Created on 07-27-2019 07:42 PM - edited 09-16-2022 07:32 AM
I have to read data from two different hive views(probably two different databases) , extract data from those views and write it into files and then, join those files and perform data formatting and finally , write it into final file.
Could you please suggest me some tools/technologies and design for hadoop for this requirement.
Created 07-30-2019 07:51 AM
Instead of joining files, you can perform operations/transformations with hive query and then final results can be written to file(parquet kind to HDFS or to local file system)
You can even use spark for transformations and joining the views. (spark will give better performance than hive sql)
Created on 07-30-2019 03:45 AM - edited 07-30-2019 03:46 AM
Hello,
Did you try performing this with the help of Hive queries which I think would be possible?
1. CREATE a new empty table with the columns with correct datatypes as per the requirement (meaning the final file's column structure)
2. INSERT data into this new table with a SELECT query with JOIN to join the data from both the views.
3. You will have the files present in the table's HDFS directory. This would be your final file.
Thanks!
Created 08-06-2019 06:04 AM
Thanks for the reply !
Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.
Here are the two approach i came up with
1. Extract data from Hive view into files.
2. Create intermediate Hive tables and load data extracted from views.
3. Join the new hive tables to generate the final file.
Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.
Created 07-30-2019 07:51 AM
Instead of joining files, you can perform operations/transformations with hive query and then final results can be written to file(parquet kind to HDFS or to local file system)
You can even use spark for transformations and joining the views. (spark will give better performance than hive sql)
Created 08-06-2019 06:05 AM
Thanks for the reply !
Views are already created by joining many underlying table, hence joining the views again for data aggregation will result performance issue.
Here are the two approach i came up with
1. Extract data from Hive view into files.
2. Create intermediate Hive tables and load data extracted from views.
3. Join the new hive tables to generate the final file.
Another approach to use PySpark to read data from views directly , aggreate and transform the data and generate the final output file.