Storage & formats of data & compression

There are multiple storage formats which are suitable for storing data in HDFS such as plain text files, rich file formats like Avro and Parquet, Hadoop specific formats like Sequence files. These formats have their own pros and cons depending upon the use cases.

  • Raw Data Formats

  • Processed Data Formats

For raw data, access patterns would be different from processed data and hence file formats would be different from processed data.

For doing processing over raw data, we usually use all the fields of data and hence our underlying storage system should support such kind of use case efficiently. But we will only be accessing only few columns of processed data in our analytical queries so our underlying storage system should be able to handle such case in most efficient way, in terms of disk I/O etc.

Raw Data Formats:

Plain Text File

A very common use case of Hadoop ecosystem is to store log files or other plain text files having unstructured data, for storage and analytics purpose.

These text files could easily eat up whole disk space so proper compression mechanism is required depending upon use case.

Structured Text Data

There are more sophisticated forms of text files having data in some standardized form such as CSV, TSV, XML or JSON files.

Binary files

We can store binary files such as images/ videos as is.

Avro

  • Avro is language neutral data serialization

  • Avro formatted data can be described through language independent schema. Hence Avro formatted data can be shared across applications using different languages.

  • Avro stores the schema in header of file so data is self-describing.

  • Avro formatted files are splittable and compressible and hence it’s a good candidate for data storage in Hadoop ecosystem.

  • Schema Evolution – Schema used to read a Avro file need not be same as schema which was used to write the files. This makes it possible to add new fields.

  • Avro Schema is usually written in JSON format. We can generate schema files using Avro provided utilities from Java POJOs as well.

Processed data file formats:

Parquet

Parquet is a columnar format. Columnar formats works well where only few columns are required in query/ analysis.

  • Only required columns would be fetched / read, it reduces the disk I/O.

  • Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.

  • Parquet provides very good compression upto 75% when used with compression formats like snappy.

  • Parquet can be read and write using Avro API and Avro Schema.

  • It also provides predicate pushdown, thus reducing further disk I/O cost.

Columnar v/s Row formats OR Parquet v/s Avro

Columnar formats are generally used where you need to query upon few columns rather than all the fields in row because their column oriented storage pattern is well suited for the same.

On the other hand Row formats are used where you need to access all the fields of row. So generally Avro is used to store the raw data because during processing usually all the fields are required.

Compression

Big data solutions should be able to process the large amount of data in quick time.

Compressing data would speed up the I/O operations and would save storage space as well. But this could increase the processing time and CPU utilization because of decompression.

So balance is required – more the compression – lesser is the data size but more the processing and CPU utilization.

Compressed files should also be splittable to support parallel processing. If a file is not splittable, it means we cannot input it to multiple tasks running in parallel and hence we lose the biggest advantage of parallel processing frameworks

Last updated