Star Schema in Hive and Impala 2

Star Schema in Big Data

Someone on the Hortonworks Community asked about how to design star schema with Hive. This is a question I hear in some way or another from various stakeholders in large enterprises we work with at Big Data Partnership. And I usually answer it by taking a step back and I did that answering the community question:

If you work with star schema in Hive/Impala, take a step back and reflect if you need to and if the overall data flow in your infrastructure is correct. It may be reasonable though it can be a sign of a bigger issue.

Avoid copying and pasting your existing data storage and processing strategies from RDBMS to Hive and Impala. Bear with my answer; it may seem a little tangential to your question but I think it is essential to understand the strategic context within which one works to make reasonable tactical, day-to-day decision.

Companies often suffer from fragmented storage systems (silos) and the incoherent and disagreeing data that has to be joined manually with traditional systems struggling with (for their design) expensive table scans. As a quick fix organisations often copy data, e.g. via Sqoop, into their Hadoop cluster to do some analysis or ETL and then export it back into their RDMBS or access it via ODBC in the cluster. And usually, they are dissatisfied with the outcome. This is not an issue of the tooling though and more related to who it is being used.

I have seen a lot of people approach Hive and Impala in the same fashion as they have been used to with RDBMS in the past. Naturally, there are a few differences, i.e. that an RDMBS can be optimised for small data access and joining via indices while the big data approach is optimised for large-scale data processing. While both can be bent to be used for the either purpose (see Hive’s newer ACID support and columnar storage via ORC, for example) planning your infrastructure correctly can remove a lot of pain.

Think about what are the ‘raw’ immutable events you want to store and how to you want to represent them based on your needs, e.g. access patterns and analytics. Then consider if you can reorganise the flow of data in your organisation (step by step) to cut down on the existing systems. If you don’t need near real-time processing you can, for example, copy data from source systems (logs, CSV, JSON messages, etc) regularly into HDFS and store it as immutable facts. Try and modify it as little as possible and only conform it (e.g. date normalisation) and optimise it (compress and partition the storage locations). This data will be the foundation of your data lake. On top of this dataset, you can run regular processes to build derived, optimised datasets (think of them as something alike materialised views) which may join various core datasets and even include denormalisation or join in facts as needed. This data may also be exported into data marts for optimised access patterns or third party systems.

Note, that if you use your data lake as a multi-tenancy environment to ensure that you have data governance and appropriate access limitations in place to prevent people circumventing your overall data flow pattern (ideally uni-directional) and worse build business critical processes on top of unsupported datasets.

2 thoughts on “Star Schema in Hive and Impala

  1. Reply Gustau Franch Aug 24,2016 19:39

    Your article is very interesting and more vigent for my now. In this moment we pretend to migrate our Campaign manager solution to a Hadoop Stack. Currently we have problem of scalability and the performance, and specially in the performance from the BI. We prepare a datamart to be ingest by Qlik. I arrive later of this development and I was thinking to create a intermediate layer with a star schema to optimize the navigation (Drill up/Down) into this BI. When I investigate about this (this solution was normal in DWH ecosystem) I found litlle solution a references about it. Maybe kylin and little more. That situation, which suggests to me that star schema is not usually used in Big Data ecosystem. But, what other form we can optimize the access data into the BI ?

    We are initiating a pilot with Kafka-flume-spark-HDFS(DataLake)-HBase-Hive-Kylin-Tableau and we want to know your opinion about. The main data to ingest and precess is the clikstream generated in the web of our customers to evaluate the impact of the campaigns launched. Then we calculate CPL/A/V/M, achievement of objectives, budgets, segmentation, navigation tracking, … new strategies

    Does having into account then the intensive use of navigation data in the BI, it is convenient to construct this intermediate layer cube with star schema? or the new paradigm requires new ways to optimize the data access?

    • Reply Christian Prokopp Aug 26,2016 06:22

      Hello Gustau,

      The problem you describe remains a challenge for many companies. It can be addressed and the approach with OLAP and Kylin is one option. As I highlighted in the article, you may also opt to derive various data marts from the data in Hadoop to store in RDBMS for example to use your traditional toolset. Connecting your Tableau and the likes via ODBC into Hive will give you mixed results unless you know exactly how to optimise for your use cases. Hive with LLAP wand SparkSQL as an execution engine may address some performance issues. Others have used Presto, Impala and Drill with varying degree of success. To be honest, this is a topic too complex to discuss via comments or emails and is one of the challenges we help our customers with at Think Big. Feel free to contact Think Big or me if you like an in-depth conversation.


Leave a Reply