Star Schema in Hive and Impala 2

Star Schema in Hive and Impala
Someone on the Hortonworks Community asked about how to design star schema with Hive. This is a question I hear in some way or another from various stakeholders in large enterprises we work with at Big Data Partnership. And I usually answer it by taking a step back and I did that answering the community ... read more →

ORC: An Intelligent Big Data file format for Hadoop and Hive 11

ORC: An Intelligent Big Data file format for Hadoop and Hive
RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format. My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and ... read more →

Faster Big Data on Hadoop with Hive and RCFile 5

Faster Big Data on Hadoop with Hive and RCFile
SQL on Hadoop with Hive makes Big Data accessible. Yet performance can lack. RCFile (Record Columnar File) are great optimisation for Big Data with Hive. The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive ... read more →

Getting Started with Big Data with Text and Apache Hive 3

Getting Started with Big Data with Text and Apache Hive
Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast. Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are ... read more →

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez
Hadoop 1.0 is increasingly challenged as slow and limited in its application, now that the hype is dying down. Marketing departments, riding the Big Data wave, wildly exaggerated Hadoop’s ability. Hadoop 2.0, surprisingly, is about to prove them somewhat right with two major developments. read more →

Democratize Big Data With Hadoop and Hive

Democratize Big Data With Hadoop and Hive
You have started to process data with cloud computing platforms like Amazon Web Service (AWS)’s Elastic MapReduce (EMR). Now that you use it regularly, other stakeholders are getting curious. You increasingly find yourself firing up an EMR cluster to quickly answer a question or try something out. It may be time to change the way ... read more →

Get Started: Big Data Crunching in the Cloud

Get Started: Big Data Crunching in the Cloud
Even if your business case is constantly evolving, you will still want to leverage big data, but being tied to a single infrastructure will limit your capital and your options. Big data is not a starting point but a destination for many startups and teams. It becomes a conversation point as a result of the ... read more →

Hint: Hive 0.8.1.1 on AWS – avoid a regression bug

read more →