Support Questions

jayes · ‎05-29-2024

i am trying to check the files that are created under data folder when we export table to hdfs location.

I have single column table with 5000 rows but after export it created only one data file named as 000000_0
I have another table with partition, 2 partitions with 95 rows each approx. and for this table export is see 2 folders each for partition and then under it only one file.

Can you please help me to know how are the data files for the exported table are created and how is it distributed?
I mean is it only one data file for each table without partitioned? or it will create multiple files based on specific row count or size count or on what criteria it is split?

ChethanYM · ‎05-30-2024

Hi @jayes May i know how you are exporting the table into HDFS? what is the command?

jayes · ‎06-03-2024

Command to dump the command : export table test to '/jayesh/testdump1';

ggangadharan · ‎06-02-2024

When exporting tables to HDFS in Hive, the creation and distribution of data files depend on several factors such as the table structure (partitioned or non-partitioned), the underlying storage format, and cluster configurations.

Non-Partitioned Tables

For non-partitioned tables, typically, Hive will create a single data file if the data is small enough to fit within a single block of the HDFS file system. In your case, a table with 5000 rows and a single column is likely small enough that Hive writes it into one data file (e.g., 000000_0)

Partitioned Tables

For partitioned tables, Hive will create separate directories for each partition and within each partition directory, it will create data files. The number of files within each partition directory can depend on the size of the data and the settings of your Hive and HDFS configurations.

jayes · ‎06-03-2024

Hello @ggangadharan ,

Can you please help me with the size per file ? or some setting where the size is mentioned ?

ggangadharan · ‎06-04-2024

The export table command automatically creates a distributed copy (distcp) job when the table location contains a large number of files. This improves efficiency for handling massive datasets.

The size of the exported file will match the size of the table data. You can adjust the memory allocated to the distcp job mappers and reducers if needed to optimize performance for your specific data size

jayes · ‎06-07-2024

Hi @ggangadharan ,

Can we export the table data into multiple files via some command like numfile or anything ?

ggangadharan · ‎06-07-2024

@jayes

Hive itself doesn't offer a built-in command like numfile to directly export table data into a specific number of files.

However, you can achieve the same using a couple of approaches:

1. Spark:

Read the hive table using sparkSql

>>> df=spark.sql("select * from sample_table")

if it's managed table use HWC session.

In Apache Spark, you can control the number of partitions in a DataFrame using the repartition or coalesce methods. Using coalesce method set the number of partitions for the dataframe.

coalesced_df = df.coalesce(5)

Write the data

>>> coalesced_df.write.parquet("/tmp/coalesced_df")

Result

[hive@node4 ~]$ hdfs dfs -ls -h /tmp/coalesced_df
Found 6 items
-rw-r--r--   3 hive supergroup          0 2024-06-07 10:49 /tmp/coalesced_df/_SUCCESS
-rw-r--r--   3 hive supergroup    135.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00000-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    200.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00001-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup     68.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00002-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    155.4 M 2024-06-07 10:49 /tmp/coalesced_df/part-00003-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r--   3 hive supergroup    132.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00004-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
[hive@node4 ~]$

ggangadharan · ‎06-07-2024

2. An alternative is to write a script (e.g., Bash) that interacts with Hive and potentially your desired output format.

Cloudera Community

Support Questions

Export command creating files under data folder

Non-Partitioned Tables

Partitioned Tables