- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to load existing partitoned parquet data in hive table from S3 bucket ?
- Labels:
-
Apache Hadoop
-
Apache Spark
Created ‎02-07-2023 11:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Currently i am having a setup where i am already having partitioned parquet data in my s3 bucket, which i want to dynamically bind with hive table.
I am able to achieve it for single partition, but i need help loading data from all the partitions in the table from existing partitioned parquet data from s3 bucket.
Thank you.
Created ‎02-09-2023 06:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@codiste_m By default hive will be using Static Partitioning. With Hive you can do Dynamic Partitioning, but i am not sure how well that works with existing data in existing folders. I believe this creates the correct partitions based on the schema, and is creating those partition folders as the data inserts into the storage path.
It sounds like you will need to execute a load data command for all partitions you want to query.
Created ‎09-27-2023 07:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if the partition data exists like below:
<s3:bucket>/<some_location>/<part_column>=<part_value>/<filename>
you can create a external table by specifiying above location and run 'msck repair table <table_name> sync partitions' to sync partitions. validate the data by running some sample select statements.
Once it's done you can create new external table with another bucket and run insert statement with dynamic partition.
Ref - https://cwiki.apache.org/confluence/display/hive/dynamicpartitions
