Someone on the Hortonworks Community asked about how to design star schema with Hive. This is a question I hear in some way or another from various stakeholders in large enterprises we work with at Big Data Partnership. And I usually answer it by taking a step back and I did that answering the community question:
If you work with star schema in Hive/Impala, take a step back and reflect if you need to and if the overall data flow in your infrastructure is correct. It may be reasonable though it can be a sign of a bigger issue.
Avoid copying and pasting your existing data storage and processing strategies from RDBMS to Hive and Impala. Bear with my answer; it may seem a little tangential to your question but I think it is essential to understand the strategic context within which one works to make reasonable tactical, day-to-day decision.
Companies often suffer from fragmented storage systems (silos) and the incoherent and disagreeing data that has to be joined manually with traditional systems struggling with (for their design) expensive table scans. As a quick fix organisations often copy data, e.g. via Sqoop, into their Hadoop cluster to do some analysis or ETL and then export it back into their RDMBS or access it via ODBC in the cluster. And usually, they are dissatisfied with the outcome. This is not an issue of the tooling though and more related to who it is being used.
I have seen a lot of people approach Hive and Impala in the same fashion as they have been used to with RDBMS in the past. Naturally, there are a few differences, i.e. that an RDMBS can be optimised for small data access and joining via indices while the big data approach is optimised for large-scale data processing. While both can be bent to be used for the either purpose (see Hive’s newer ACID support and columnar storage via ORC, for example) planning your infrastructure correctly can remove a lot of pain.
Think about what are the ‘raw’ immutable events you want to store and how to you want to represent them based on your needs, e.g. access patterns and analytics. Then consider if you can reorganise the flow of data in your organisation (step by step) to cut down on the existing systems. If you don’t need near real-time processing you can, for example, copy data from source systems (logs, CSV, JSON messages, etc) regularly into HDFS and store it as immutable facts. Try and modify it as little as possible and only conform it (e.g. date normalisation) and optimise it (compress and partition the storage locations). This data will be the foundation of your data lake. On top of this dataset, you can run regular processes to build derived, optimised datasets (think of them as something alike materialised views) which may join various core datasets and even include denormalisation or join in facts as needed. This data may also be exported into data marts for optimised access patterns or third party systems.
Note, that if you use your data lake as a multi-tenancy environment to ensure that you have data governance and appropriate access limitations in place to prevent people circumventing your overall data flow pattern (ideally uni-directional) and worse build business critical processes on top of unsupported datasets.