We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format, we are contemplating on choosing one of them but before taking the call, we are wondering what are the merits & demerits of using either one of them.
Factors we are trying to figure out are:
1. Performance ( Data Volume is 2-5 GB)
2. Loading vs Reading Data
3. How easier it is to extract Metadata(Structure) info from either of these files.
Injected data will be consumed by other applications which support both JSON & CSV.
Since your source of data is in table format (excel) csv is a better match.
JSON would be preferred if you'd have to store hierarchical data (objects).
A good summary of different file formats in big data space could be found here: