Support Questions

Find answers, ask questions, and share your expertise

Export command creating files under data folder

avatar
Contributor


i am trying to check the files that are created under data folder when we export table to hdfs location.

I have single column table with 5000 rows but after export it created only one data file named as 000000_0
I have another table with partition, 2 partitions with 95 rows each approx. and for this table export is see 2 folders each for partition and then under it only one file. 

Can you please help me to know how are the data files for the exported table are created and how is it distributed? 
I mean is it only one data file for each table without partitioned? or it will create multiple files based on specific row count or size count or on what criteria it is split?

8 REPLIES 8

avatar
Master Collaborator

Hi @jayes May i know how you are exporting the table into HDFS? what is the command?

avatar
Contributor

Command to dump the command : export table test to '/jayesh/testdump1';

avatar
Super Collaborator

When exporting tables to HDFS in Hive, the creation and distribution of data files depend on several factors such as the table structure (partitioned or non-partitioned), the underlying storage format, and cluster configurations. 

Non-Partitioned Tables

For non-partitioned tables, typically, Hive will create a single data file if the data is small enough to fit within a single block of the HDFS file system. In your case, a table with 5000 rows and a single column is likely small enough that Hive writes it into one data file (e.g., 000000_0)

Partitioned Tables

For partitioned tables, Hive will create separate directories for each partition and within each partition directory, it will create data files. The number of files within each partition directory can depend on the size of the data and the settings of your Hive and HDFS configurations.







avatar
Contributor

Hello @ggangadharan ,

Can you please help me with the size per file ? or some setting where the size is mentioned ?

avatar
Super Collaborator

The export table command automatically creates a distributed copy (distcp) job when the table location contains a large number of files. This improves efficiency for handling massive datasets.

The size of the exported file will match the size of the table data. You can adjust the memory allocated to the distcp job mappers and reducers if needed to optimize performance for your specific data size

avatar
Contributor

Hi @ggangadharan ,

Can we export the table data into multiple files via some command like numfile or anything ?

avatar
Super Collaborator

@jayes 

Hive itself doesn't offer a built-in command like numfile to directly export table data into a specific number of files.

However, you can achieve the same using a couple of approaches:

1. Spark: 

Read the hive table using sparkSql 

>>> df=spark.sql("select * from sample_table")

 if it's managed table use HWC session. 

In Apache Spark, you can control the number of partitions in a DataFrame using the repartition or coalesce methods. Using coalesce method set the number of partitions for the dataframe.

coalesced_df = df.coalesce(5)


Write the data 

>>> coalesced_df.write.parquet("/tmp/coalesced_df")


Result

[hive@node4 ~]$ hdfs dfs -ls -h /tmp/coalesced_df
Found 6 items
-rw-r--r--   3 hive supergroup          0 2024-06-07 10:49 /tmp/coalesced_df/_SUCCESS
-rw-r--r--   3 hive supergroup    135.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00000-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    200.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00001-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup     68.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00002-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    155.4 M 2024-06-07 10:49 /tmp/coalesced_df/part-00003-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    132.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00004-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
[hive@node4 ~]$

avatar
Super Collaborator

2.  An alternative is to write a script (e.g., Bash) that interacts with Hive and potentially your desired output format.