Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Raw Data Ingestion into Cluster


I have question regarding, if we are creating common reservoir for the files in hadoop which can later be used for any purpose, may it be spark processing or pig or hive ...etc.

Now I have question on while sqooping data into the hadoop, which file format to choose, any industry wide standards ?

1)Text delimited ( compressed or un-compressed ?)





If your using HDFS just to land tables (rows and columns) extracted from an RDBMS via Sqoop, then just store it as raw text if your looking for speed. Compress it if your concerned about space in HDFS.

Use Avro if you want a schema for the data.

I would use Parquet for the final Hive table if the query access patterns are to select only a few columns and do aggregations. If the query access patterns are to select all the columns, then a columnar format such as Parquet would not be needed.

What type of analysis would you do on the files using Spark? Spark has a lot of optimizations for Parquet. Not only can Spark quickly parse and process data in Parquet files, Spark can also push filtering down to the disk layer via Predicate Pushdown Optimization. Spark can also process text files very quickly via the CSV parser from Databricks.

Rising Star

Usually If we are using sqoop to import the data from any RDBMS,following is the folder structure we maintain in hdfs

raw_tbl->/data/raw (Landing folder for the initial text data after sqoop import)

source_tbl ->/data/source/(Create source table as ORC by selecting the data from raw tbl)

master_tbl -> /data/publish/<partion> (Create a master table in ORC by creating a parttiion and move the data from the source_tbl)

If you are using spark to save the dataframe,better save the dataframe in ORC format,since it gives better compression than any other format( such as avro ,parquet)