Support Questions

Freakabhi · ‎03-22-2017

I have question regarding, if we are creating common reservoir for the files in hadoop which can later be used for any purpose, may it be spark processing or pig or hive ...etc.

Now I have question on while sqooping data into the hadoop, which file format to choose, any industry wide standards ?

1)Text delimited ( compressed or un-compressed ?)

2) AVRO

3)Parquet

bmathew · ‎03-22-2017

If your using HDFS just to land tables (rows and columns) extracted from an RDBMS via Sqoop, then just store it as raw text if your looking for speed. Compress it if your concerned about space in HDFS.

Use Avro if you want a schema for the data.

I would use Parquet for the final Hive table if the query access patterns are to select only a few columns and do aggregations. If the query access patterns are to select all the columns, then a columnar format such as Parquet would not be needed.

What type of analysis would you do on the files using Spark? Spark has a lot of optimizations for Parquet. Not only can Spark quickly parse and process data in Parquet files, Spark can also push filtering down to the disk layer via Predicate Pushdown Optimization. Spark can also process text files very quickly via the CSV parser from Databricks.

rkandula · ‎03-23-2017

Usually If we are using sqoop to import the data from any RDBMS,following is the folder structure we maintain in hdfs

raw_tbl->/data/raw (Landing folder for the initial text data after sqoop import)

source_tbl ->/data/source/(Create source table as ORC by selecting the data from raw tbl)

master_tbl -> /data/publish/<partion> (Create a master table in ORC by creating a parttiion and move the data from the source_tbl)

If you are using spark to save the dataframe,better save the dataframe in ORC format,since it gives better compression than any other format( such as avro ,parquet)

Cloudera Community

Support Questions

Raw Data Ingestion into Cluster

Ingesting Drone Data From Ryze Tello Part 1 - Setu...

Ingest BTC.com and Blockchain.com Data via Apache ...

Drones (UAV) Data Ingest Methods

MiniFi for Sensor Data Ingest from Devices as Reco...

Ingesting and Testing JMS Data with NiFi into Hive

Scheduled Incremental Ingestion of MS-SQL data to ...

Ingesting Flight Data ADS-B USB Receiver with Apac...

Ingesting SAP HANA data - All you need is JDBC

Data Ingest with Apache Zeppelin + Apache Spark 1....

Re: Log Forwarding & Ingestion Patterns using MiNi...