Support Questions

Find answers, ask questions, and share your expertise

Raw Data Ingestion into Cluster

avatar
Contributor

I have question regarding, if we are creating common reservoir for the files in hadoop which can later be used for any purpose, may it be spark processing or pig or hive ...etc.

Now I have question on while sqooping data into the hadoop, which file format to choose, any industry wide standards ?

1)Text delimited ( compressed or un-compressed ?)

2) AVRO

3)Parquet

2 REPLIES 2

avatar

If your using HDFS just to land tables (rows and columns) extracted from an RDBMS via Sqoop, then just store it as raw text if your looking for speed. Compress it if your concerned about space in HDFS.

Use Avro if you want a schema for the data.

I would use Parquet for the final Hive table if the query access patterns are to select only a few columns and do aggregations. If the query access patterns are to select all the columns, then a columnar format such as Parquet would not be needed.

What type of analysis would you do on the files using Spark? Spark has a lot of optimizations for Parquet. Not only can Spark quickly parse and process data in Parquet files, Spark can also push filtering down to the disk layer via Predicate Pushdown Optimization. Spark can also process text files very quickly via the CSV parser from Databricks.

avatar
Rising Star

Usually If we are using sqoop to import the data from any RDBMS,following is the folder structure we maintain in hdfs

raw_tbl->/data/raw (Landing folder for the initial text data after sqoop import)

source_tbl ->/data/source/(Create source table as ORC by selecting the data from raw tbl)

master_tbl -> /data/publish/<partion> (Create a master table in ORC by creating a parttiion and move the data from the source_tbl)

If you are using spark to save the dataframe,better save the dataframe in ORC format,since it gives better compression than any other format( such as avro ,parquet)