Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast.
Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are some of the best-practices, storage formats, and strategies to balance optimization with getting things done?
This four-part post series explores a common big data scenario around extracting, transforming, loading, measuring, and reporting big data from diverse data stores. This is the domain of Hive and Hadoop. The series will illustrate why file formats matter when querying big data with Hive.
This first post explores why text is an option for data storage and exchange with Hive. The next post will discuss challenges of text formats and how to optimize text-encoded big data with Hive. The third part will introduce sequence file and RCFile formats, respectively the Hadoop default and Hive state-of-the-art in performance. The last part will discuss the major improvements arriving this year with the next generation, the ORC (Optimized Row Columnar) file format.
Our example scenario is a commonplace one: A startup (or department) collects a lot of data and wants insight and products built on top. It has logs in a data sink (e.g., HDFS containing web user actions), transactional and user data in SQL, and factual (e.g., product information) in a NoSQL store. It needs to extract and combine the data for reports, metrics, and to feed merged data into another product (e.g., a search index). The data architect chose Hive to painlessly and economically process the data daily.
Hive on top of Hadoop makes data accessible for ETL and data insight with its SQL-like language. It is powerful and scalable because of Hadoop. Hive can be extended through user-defined functions for demanding data processing. It ingests a variety of data formats and can be modified to read and write nearly anything. Hive enables rapid development cycles, can be scaled with more machines, and getting started with it is easy and inexpensive nowadays. Importantly, mapReduce jobs using Java or abstraction frameworks like Cascading, Crunch or Dumbo, Pig, or machine learning (e.g., with Mahout) can be employed on the same cluster for more complex algorithms, and feed back into Hive.
The case for text formats
Hive is used regularly to read text files like logs, CSV, or JSON exports from other systems. Similarly the output from Hive — to be consumed by Hive again or other systems — can be a text format. Unfortunately, text is not an ideal file format. It is inefficient and error prone because of encoding issues. Nevertheless, text files are great to readily interact with other systems and to prototype.
In our scenario the startup runs a daily export of everything as text: CSV from an SQL store using Sqoop, JSON from a NoSQL store, and text logs from a web server to HDFS. Hive subsequently ingests, transforms, and writes the result back to HDFS. The output can be, for example, in a CSV-like format. It is easily understood by other tools for business intelligence analysis and indexable by the search engine.
This solution gets the startup started without developing or deploying Hadoop adapters to all its data stores. It also avoids the chance that its Hadoop cluster might bring down the data stores by trying to connect with numerous mappers to the (No)SQL stores. So why would it not always use a plain big-text file to store and exchange data?
Firstly, ingesting text is not as straightforward as it seems; encoding and file formats differ and are not easily detected at runtime. More importantly, text is not an optimal storage format, and it pays off to optimize it. In recent posts I highlighted current research on the performance benefits of lazy, adaptive indexing data in Hadoop. Some of this benefit can be attained today, and more is on its way with the Stinger initiative. Compression, binary encoding, splitting and grouping of data, minimal indices, and subsequent IO improvements of factor five and higher are achievable with this optimisation, as the next posts of the series will show. This can make the difference between updating the data daily or weekly, or running 10 or 50 servers.
The following three posts will discuss how to further optimize the scenario. The posts will highlight problems and optimizations for text, the state-of-the-art, and the next generation of Hive data storage formats — how they work and why they matter.