Created 05-29-2024 07:15 AM
i am trying to check the files that are created under data folder when we export table to hdfs location.
I have single column table with 5000 rows but after export it created only one data file named as 000000_0
I have another table with partition, 2 partitions with 95 rows each approx. and for this table export is see 2 folders each for partition and then under it only one file.
Can you please help me to know how are the data files for the exported table are created and how is it distributed?
I mean is it only one data file for each table without partitioned? or it will create multiple files based on specific row count or size count or on what criteria it is split?
Created 05-30-2024 10:24 PM
Hi @jayes May i know how you are exporting the table into HDFS? what is the command?
Created 06-03-2024 01:44 AM
Command to dump the command : export table test to '/jayesh/testdump1';
Created 06-02-2024 11:45 PM
When exporting tables to HDFS in Hive, the creation and distribution of data files depend on several factors such as the table structure (partitioned or non-partitioned), the underlying storage format, and cluster configurations.
For non-partitioned tables, typically, Hive will create a single data file if the data is small enough to fit within a single block of the HDFS file system. In your case, a table with 5000 rows and a single column is likely small enough that Hive writes it into one data file (e.g., 000000_0)
For partitioned tables, Hive will create separate directories for each partition and within each partition directory, it will create data files. The number of files within each partition directory can depend on the size of the data and the settings of your Hive and HDFS configurations.
Created 06-03-2024 01:45 AM
Hello @ggangadharan ,
Can you please help me with the size per file ? or some setting where the size is mentioned ?
Created 06-04-2024 12:02 AM
The export table command automatically creates a distributed copy (distcp) job when the table location contains a large number of files. This improves efficiency for handling massive datasets.
The size of the exported file will match the size of the table data. You can adjust the memory allocated to the distcp job mappers and reducers if needed to optimize performance for your specific data size
Created 06-07-2024 02:53 AM
Hi @ggangadharan ,
Can we export the table data into multiple files via some command like numfile or anything ?
Created 06-07-2024 04:12 AM
@jayes
Hive itself doesn't offer a built-in command like numfile to directly export table data into a specific number of files.
However, you can achieve the same using a couple of approaches:
1. Spark:
Read the hive table using sparkSql
>>> df=spark.sql("select * from sample_table")
if it's managed table use HWC session.
In Apache Spark, you can control the number of partitions in a DataFrame using the repartition or coalesce methods. Using coalesce method set the number of partitions for the dataframe.
coalesced_df = df.coalesce(5)
Write the data
>>> coalesced_df.write.parquet("/tmp/coalesced_df")
Result
[hive@node4 ~]$ hdfs dfs -ls -h /tmp/coalesced_df
Found 6 items
-rw-r--r-- 3 hive supergroup 0 2024-06-07 10:49 /tmp/coalesced_df/_SUCCESS
-rw-r--r-- 3 hive supergroup 135.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00000-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 200.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00001-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 68.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00002-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 155.4 M 2024-06-07 10:49 /tmp/coalesced_df/part-00003-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 132.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00004-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
[hive@node4 ~]$
Created 06-07-2024 04:17 AM
2. An alternative is to write a script (e.g., Bash) that interacts with Hive and potentially your desired output format.