Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Mertis of JSON vs CSV file format while writing to HDFS for downstream applications

Mertis of JSON vs CSV file format while writing to HDFS for downstream applications

Explorer

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format, we are contemplating on choosing one of them but before taking the call, we are wondering what are the merits & demerits of using either one of them.

Factors we are trying to figure out are:

1. Performance ( Data Volume is 2-5 GB)

2. Loading vs Reading Data

3. How easier it is to extract Metadata(Structure) info from either of these files.

4. Parsing

Injected data will be consumed by other applications which support both JSON & CSV.

1 REPLY 1

Re: Mertis of JSON vs CSV file format while writing to HDFS for downstream applications

Contributor

Since your source of data is in table format (excel) csv is a better match.

JSON would be preferred if you'd have to store hierarchical data (objects).

A good summary of different file formats in big data space could be found here:

http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet

Don't have an account?
Coming from Hortonworks? Activate your account here