ORC: An Intelligent Big Data file format for Hadoop and Hive 11

RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.

My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and big data processing. The second post, Optimising Hadoop and Big Data with Text and Hive, discussed the case for and against text format. It highlighted why for intermediate tables RCFile is the best choice — which we covered in Faster Big Data on Hadoop with Hive and RCFile. This about to change: the next evolution of the format, ORC file, is around the corner. How much better is it?

ORC files

The running scenario for this four-part series is a startup, which processes data from different sources, SQL and NoSQL stores, and logs. The challenge with big data, as the domain matures, and for evolving deployments in companies, is to not only to process the data but to also do it efficiently, reducing cost and time required.

In the scenario, and for many companies, tables containing billions of rows and numerous columns are unexceptional. Querying and reporting on this data swiftly requires a sophisticated storage format. It ideally stores data compact and enables skipping over irrelevant parts without the need for large, complex, or manually maintained indices. The ORC file format addresses all of these issues.

The Stinger initiative heads the ORC file format development to replace the RCFile. Former should become part of the stable Hadoop releases this year. ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel processing of row collections across a cluster. Each file with the columnar layout is optimised for compression and skipping of data/columns to reduce read and decompression load.

ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows that do not match a query. It comes with basic statistics — min, max, sum, and count — on columns. Lastly, a larger block size of 256 MB by default optimizes for large sequential reads on HDFS for more throughput and fewer files to reduce load on the namenode.

ORC File Layout

ORC File Layout (source)

Hive file formats compared

Comparing the different file formats (see the graphic below) illustrates the IO benefits. Text is the baseline with a size of 1. RCFile already improves the storage requirements significantly. ORC files are even better at storing the same information without compression. In fact, ORC files store it more efficiently without compression than text with Gzip compression. Interestingly, sales data (in the example) is not very compressible in the ORC format as it is already stored efficiently. Consequently, to compress such data in ORC would be a waste of computing time. Storing the demographics data in compressed ORC format, on the other hand, reduces it tremendously and would result in an amazing performance improvement on disk IO.

File size comparison for TCP-DS data (source)

File size comparison for TCP-DS data (source)

Comparing the features between RC, Trevni, and ORC file format illustrates how the formats have evolved. The Trevni format is an in-development columnar storage format like the (O)RC format. ORC files will likely become the default choice for Hive users in the near the future. It combines all desirable features and performance benefits.

ORC file structure compared to Trevni and RC File (source)

ORC file structure compared to Trevni and RC File (source)

In summary, you may be bound to ingest and export your data in a simple text format or use Hadoop adapters to read and write it directly. Internally for intermediate storage, you should consider your choices though. Today, RCFile and soon ORC files make an excellent choice for efficient and fast data storage.

Hadoop and Hive can be used in a brute force manner but there is no need for it. You can save a lot of time and money choosing a modern data format. It requires little more than changing a line or two in your create table statement. Why would you not do it.

Summary
ORC: An Intelligent Big Data file format for Hadoop and Hive
Article Name
ORC: An Intelligent Big Data file format for Hadoop and Hive
Description
RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.
Author

11 thoughts on “ORC: An Intelligent Big Data file format for Hadoop and Hive

  1. Reply mahesh Dec 3,2014 09:49

    Very Nice Article on Hive Formats. Now I clearly understand..Thank you

  2. Pingback: Using MultipleOutputs with ORC in MapReduce | Hadoop At Home

  3. Reply Mehul Mar 27,2015 23:15

    This is an amazing series of articles. Very well written and is understandable even by a newbie. Thank you and keep up the good work.

  4. Reply dra May 28,2015 07:03

    these series of posts are really understandable, nice job!!!

  5. Reply Vignesh Aug 17,2015 07:51

    This series of posts are well structured and self explanatory. good job!!

  6. Reply SAURABH SHARMA Apr 22,2016 10:32

    Ultimate article..got a great help of how ORC is best than RC and other formats

  7. Reply Rao Kantamani May 11,2016 13:15

    Very very useful artcle with good explanation.

  8. Reply John Yost May 14,2016 11:57

    Very informative article–well done!

  9. Reply Prash Jun 14,2016 15:43

    Lot informative, all my pros and cons on ORC are clear.
    Good luck

  10. Reply Subbu Jun 30,2016 10:53

    Simple summary and a very good article to help relate too in this every growing components of Big Data. Thanks.

Leave a Reply