Hadoop’s Hive SQL interface reduces costs and to gets results fast with Big Data from Text. Simple optimisations improve the performance significantly.
The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These formats are a common denominator and are convenient to get products started quickly.
The scenario we used last time was built around a startup using data from logs, SQL, and NoSQL stores to be imported and transformed for reporting and for secondary products like a search index. Storing a table’s data in a single large text file works, but is it scalable?
The challenges of using text formats
The first catch is that dealing with row and field delimiters with text is painstaking. It regularly takes a bit of data munging and it still can make for nasty surprises. One experience shared by many is that no matter how unlikely a character may be, some input source will contain it and break the file format. Never rely solely on a character or byte combination to be unlikely enough to not show up inside big data.
Out of the box Hive merely supports a trivial text file format as an input, which does not allow for row or field delimiters inside the fields or rows, i.e. escaping them as supported by most systems reading CSV. A possible option to make it still work is to employ a pre-processing step using sed or tr, for example, to replace unwanted characters before reading it with Hive.
There is, besides the world of pain with (Unicode) encoding variations and delimiters, a performance reason to avoid text. It is slow. Text is a verbose means of storing data and it forces Hive to scan the whole data for every query. Moreover, if the data can contain the delimiters and requires escaping, or is complex like JSON or logs, then a SerDe (Serializer/Deserializer extensions) is essential to evaluate the data. They are freely available for CSV, TSV, JSON, and RegEx, for example, but cost performance since they parse the input after reading it.
In short, parsing text formats correctly does incur several performance penalties and hurdles. Text’s inefficient encoding results in excessive amount of storage and IO. Text is likely to require SerDes consuming precious CPU cycles. Importantly, all the data has to be read in a brute force fashion for any kind of query since there is no way to filter data before hand based on query information.
Optimization for text
There are benefits to using text. It is enables rapid development due to its simplicity — encoding and escaping aside. It allows to somewhat easily interact at the periphery of the Hive ETL process with other systems.
Some performance enhancement can be obtained by partitioning the data in a directory structure, e.g. by year, month, and days for logs, or by region and countries for geographic data. It enables Hive to only query and read requested parts of the data. This requires domain and a prior knowledge of the workload, and that the table is created with the corresponding schema telling Hive about the partitioning.
Hive also ships with a wide variety of compression algorithms. Hive can transparently read and write Gzip, Bzip, LZO, or Snappy compressed files, trading space and IO for computing time. Snappy is a good choice since it is a lightweight compression that will balance CPU and IO needs. Importantly, Hadoop mappers reading it can split it into smaller chunks as needed. Some compressions like Gzip are not splitable. Hadoop will have to read each file with one mapper. Consequently, if all the data is compressed into one file, then only one mapper will be able to read it and undo the parallelism of Hadoop.
At this point the limits of the text formats are reached. For this scenario, it is still sensible to use text for import and export. Internally for intermediate results, the data could be stored in any format. That can bring about tremendous speed-ups, as the next post of this series will discuss.