Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop File Formats

Highlighted

Hadoop File Formats

New Contributor

I need to consider how to write my data to Hadoop.

I'm using Spark, I got a message from Kafka topic, each message in JSON record.

I have around 200B records per day.

The data fields may be change (not alot but may be change in the future),

I need fast write and fast read, low size in disk.

What should I choose? Avro or Parquet?

If I choose Parquet/Avro, Should I need to create the table with all fields of my JSON?

If no, What is the way to create the table with Parquet format and Avro format?

Thanks!!

1 REPLY 1

Re: Hadoop File Formats

Expert Contributor

Hi @Ya ko,

Why not considering the new ORC.

https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487

Then you will get the best performance when querying from hive. And yes you have to define your table with all heh fields.

The slide 20 show how to specify the new orc library, you will have to just all the location setting to point where your data will be stored in hdfs.

Michel