question Re: Hadoop tools/technology and design recommendation in Support Questions

Hadoop tools/technology and design recommendation

iamsaaj — Fri, 16 Sep 2022 14:32:08 GMT

I have to read data from two different hive views(probably two different databases) , extract data from those views and write it into files and then, join those files and perform data formatting and finally , write it into final file.

Could you please suggest me some tools/technologies and design for hadoop for this requirement.

Re: Hadoop tools/technology and design recommendation

Gomathinayagam — Tue, 30 Jul 2019 10:46:09 GMT

Hello,

Did you try performing this with the help of Hive queries which I think would be possible?

1. CREATE a new empty table with the columns with correct datatypes as per the requirement (meaning the final file's column structure)

2. INSERT data into this new table with a SELECT query with JOIN to join the data from both the views.

3. You will have the files present in the table's HDFS directory. This would be your final file.

Thanks!

Re: Hadoop tools/technology and design recommendation

vijayshc — Tue, 30 Jul 2019 14:51:21 GMT

Instead of joining files, you can perform operations/transformations with hive query and then final results can be written to file(parquet kind to HDFS or to local file system)

You can even use spark for transformations and joining the views. (spark will give better performance than hive sql)

Re: Hadoop tools/technology and design recommendation

iamsaaj — Tue, 06 Aug 2019 13:04:24 GMT

Thanks for the reply !

Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.

Here are the two approach i came up with

1. Extract data from Hive view into files.

2. Create intermediate Hive tables and load data extracted from views.

3. Join the new hive tables to generate the final file.

Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.

Re: Hadoop tools/technology and design recommendation

iamsaaj — Tue, 06 Aug 2019 13:05:40 GMT

Thanks for the reply !

Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.

Here are the two approach i came up with

1. Extract data from Hive view into files.

2. Create intermediate Hive tables and load data extracted from views.

3. Join the new hive tables to generate the final file.

Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.