- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Export command creating files under data folder
- Labels:
-
Apache Hive
Created ‎05-29-2024 07:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i am trying to check the files that are created under data folder when we export table to hdfs location.
I have single column table with 5000 rows but after export it created only one data file named as 000000_0
I have another table with partition, 2 partitions with 95 rows each approx. and for this table export is see 2 folders each for partition and then under it only one file.
Can you please help me to know how are the data files for the exported table are created and how is it distributed?
I mean is it only one data file for each table without partitioned? or it will create multiple files based on specific row count or size count or on what criteria it is split?
Created ‎05-30-2024 10:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @jayes May i know how you are exporting the table into HDFS? what is the command?
Created ‎06-03-2024 01:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Command to dump the command : export table test to '/jayesh/testdump1';
Created ‎06-02-2024 11:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When exporting tables to HDFS in Hive, the creation and distribution of data files depend on several factors such as the table structure (partitioned or non-partitioned), the underlying storage format, and cluster configurations.
Non-Partitioned Tables
For non-partitioned tables, typically, Hive will create a single data file if the data is small enough to fit within a single block of the HDFS file system. In your case, a table with 5000 rows and a single column is likely small enough that Hive writes it into one data file (e.g., 000000_0)
Partitioned Tables
For partitioned tables, Hive will create separate directories for each partition and within each partition directory, it will create data files. The number of files within each partition directory can depend on the size of the data and the settings of your Hive and HDFS configurations.
Created ‎06-03-2024 01:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @ggangadharan ,
Can you please help me with the size per file ? or some setting where the size is mentioned ?
Created ‎06-04-2024 12:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The export table command automatically creates a distributed copy (distcp) job when the table location contains a large number of files. This improves efficiency for handling massive datasets.
The size of the exported file will match the size of the table data. You can adjust the memory allocated to the distcp job mappers and reducers if needed to optimize performance for your specific data size
Created ‎06-07-2024 02:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @ggangadharan ,
Can we export the table data into multiple files via some command like numfile or anything ?
Created ‎06-07-2024 04:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jayes
Hive itself doesn't offer a built-in command like numfile to directly export table data into a specific number of files.
However, you can achieve the same using a couple of approaches:
1. Spark:
Read the hive table using sparkSql
>>> df=spark.sql("select * from sample_table")
if it's managed table use HWC session.
In Apache Spark, you can control the number of partitions in a DataFrame using the repartition or coalesce methods. Using coalesce method set the number of partitions for the dataframe.
coalesced_df = df.coalesce(5)
Write the data
>>> coalesced_df.write.parquet("/tmp/coalesced_df")
Result
[hive@node4 ~]$ hdfs dfs -ls -h /tmp/coalesced_df
Found 6 items
-rw-r--r-- 3 hive supergroup 0 2024-06-07 10:49 /tmp/coalesced_df/_SUCCESS
-rw-r--r-- 3 hive supergroup 135.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00000-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 200.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00001-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 68.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00002-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 155.4 M 2024-06-07 10:49 /tmp/coalesced_df/part-00003-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 132.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00004-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
[hive@node4 ~]$
Created ‎06-07-2024 04:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2. An alternative is to write a script (e.g., Bash) that interacts with Hive and potentially your desired output format.
