Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark job HDFS Direct read vs Read from Hive External table

Highlighted

Spark job HDFS Direct read vs Read from Hive External table

Hi ,

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.

Directory structure:

/data/table1/INGEST_DATE=20180101

/data/table1/INGEST_DATE=20180102

/data/table1/INGEST_DATE=20180103 etc.

Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).

I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.

Thanks in Advance

Sundar Gampa
1 REPLY 1
Highlighted

Re: Spark job HDFS Direct read vs Read from Hive External table

@Sundar Gampa I think option 2 will be simplest approach and also more flexible in case you need to repartition/filter across multiple partitions.

Don't have an account?
Coming from Hortonworks? Activate your account here