Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop tools/technology and design recommendation

avatar
Explorer

I have to read data from two different hive views(probably two different databases) , extract data from those views and write it into files and then, join those files and perform data formatting and finally , write it into final file.

Could you please suggest me some tools/technologies and design for hadoop for this requirement.

1 ACCEPTED SOLUTION

avatar
Explorer

Instead of joining files, you can perform operations/transformations with hive query and then final results can be written to file(parquet kind to HDFS or to local file system)

 

You can even use spark for transformations and joining the views. (spark will give better performance than hive sql)

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Hello, 

Did you try performing this with the help of Hive queries which I think would be possible?

 

1. CREATE a new empty table with the columns with correct datatypes as per the requirement (meaning the final file's column structure)

2. INSERT data into this new table with a SELECT query with JOIN to join the data from both the views. 

3. You will have the files present in the table's HDFS directory. This would be your final file.

 

Thanks!

avatar
Explorer

Thanks for the reply !

Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.

 

Here are the two approach i came up with 

1. Extract data from Hive view into files.

2. Create intermediate Hive tables and load data extracted from views.

3. Join the new hive tables to generate the final file.

 

Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.

avatar
Explorer

Instead of joining files, you can perform operations/transformations with hive query and then final results can be written to file(parquet kind to HDFS or to local file system)

 

You can even use spark for transformations and joining the views. (spark will give better performance than hive sql)

avatar
Explorer

Thanks for the reply !

Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.

Here are the two approach i came up with 

1. Extract data from Hive view into files.

2. Create intermediate Hive tables and load data extracted from views.

3. Join the new hive tables to generate the final file.

 

Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.