RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.
My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and big data processing. The second post, Optimising Hadoop and Big Data with Text and Hive, discussed the case for and against text format. It highlighted why for intermediate tables RCFile is the best choice — which we covered in Faster Big Data on Hadoop with Hive and RCFile. This about to change: the next evolution of the format, ORC file, is around the corner. How much better is it?
The running scenario for this four-part series is a startup, which processes data from different sources, SQL and NoSQL stores, and logs. The challenge with big data, as the domain matures, and for evolving deployments in companies, is to not only to process the data but to also do it efficiently, reducing cost and time required.
In the scenario, and for many companies, tables containing billions of rows and numerous columns are unexceptional. Querying and reporting on this data swiftly requires a sophisticated storage format. It ideally stores data compact and enables skipping over irrelevant parts without the need for large, complex, or manually maintained indices. The ORC file format addresses all of these issues.
The Stinger initiative heads the ORC file format development to replace the RCFile. Former should become part of the stable Hadoop releases this year. ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel processing of row collections across a cluster. Each file with the columnar layout is optimised for compression and skipping of data/columns to reduce read and decompression load.
ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows that do not match a query. It comes with basic statistics — min, max, sum, and count — on columns. Lastly, a larger block size of 256 MB by default optimizes for large sequential reads on HDFS for more throughput and fewer files to reduce load on the namenode.
Hive file formats compared
Comparing the different file formats (see the graphic below) illustrates the IO benefits. Text is the baseline with a size of 1. RCFile already improves the storage requirements significantly. ORC files are even better at storing the same information without compression. In fact, ORC files store it more efficiently without compression than text with Gzip compression. Interestingly, sales data (in the example) is not very compressible in the ORC format as it is already stored efficiently. Consequently, to compress such data in ORC would be a waste of computing time. Storing the demographics data in compressed ORC format, on the other hand, reduces it tremendously and would result in an amazing performance improvement on disk IO.
Comparing the features between RC, Trevni, and ORC file format illustrates how the formats have evolved. The Trevni format is an in-development columnar storage format like the (O)RC format. ORC files will likely become the default choice for Hive users in the near the future. It combines all desirable features and performance benefits.
In summary, you may be bound to ingest and export your data in a simple text format or use Hadoop adapters to read and write it directly. Internally for intermediate storage, you should consider your choices though. Today, RCFile and soon ORC files make an excellent choice for efficient and fast data storage.
Hadoop and Hive can be used in a brute force manner but there is no need for it. You can save a lot of time and money choosing a modern data format. It requires little more than changing a line or two in your create table statement. Why would you not do it.