Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How Files loaded through a Hive table can be determined?

Solved Go to solution
Highlighted

How Files loaded through a Hive table can be determined?

Consider I have a File size of 150M and Im loading into a hive table. Block size in HDFS is 128MB.

Now how the files will be present underneath the Hive?.

I believe it will be splitted and loaded as 0000_0,0000_1 .,.etc.

Why it is splitted into multiple chunks? Does each files represent the block size?

Does the block size and mapred size has anything to do with file size in hive?

If I alter the mapred size then will the file size underneath hive will be changed?

Do we have any control over the no of files created in hive while loading? I understand through merge mapreduce job we will be able to reduce/increase it. Say I just need 10 files to be created and not more than that while loading a hive table. Is it even possible?

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How Files loaded through a Hive table can be determined?

Super Guru

@Bala Vignesh N V

Why it is splitted into multiple chunks? Does each files represent the block size?

Are you importing data using Sqoop? In that case. it is based on --number-mappers or -m argument. In this case no, it just depends on number of mappers and data is split by primary key unless you specify split-by on a different key.

If you are not using Sqoop then hive uses mapreduce.input.fileinputformat.split.minsize. Since Hive 0.13, hive uses org.apache.hadoop.hive.ql.io.CombineHiveInputFormat by default for hive.input.format. This value also combines files that are smaller than mapreduce.input.fileinputformat.split.minsize assuming data is not of different nodes.

Does the block size and mapred size has anything to do with file size in hive?

Here is the formula to calculate split.

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))

Now based on this above formula, you should be able to tell how many files will be generated. This should answer your other questions too hopefully. Please let me know if you have a followup question.

8 REPLIES 8

Re: How Files loaded through a Hive table can be determined?

@mqureshi @Neeraj Sabharwal @Jay SenSharma Could anyone help me on this please? Thanks in advance!

Re: How Files loaded through a Hive table can be determined?

It depends on the no. of mappers. suppose if you are importing files from mysql to hive, then the command would be-

sqoop import --connect jdbc:mysql://ipaddress/databasename --username abc --password abc --table tablename --driver com.mysql.jdbc.Driver --hive-table hivetablename --hive-import -m 1 here -m (parameter), which defines no. of mappers, will define the number of files to be created. in my example I have mentioned -m 1, so only one file will get created after importing.

Re: How Files loaded through a Hive table can be determined?

@Mohit Varshney I understand that when i ingest data through sqoop the no of mappers will decide the no of files ingested into a hive table. I just wanted to understand the same when im loading a data between hive tables. Thanks

Re: How Files loaded through a Hive table can be determined?

Super Guru

@Bala Vignesh N V

Why it is splitted into multiple chunks? Does each files represent the block size?

Are you importing data using Sqoop? In that case. it is based on --number-mappers or -m argument. In this case no, it just depends on number of mappers and data is split by primary key unless you specify split-by on a different key.

If you are not using Sqoop then hive uses mapreduce.input.fileinputformat.split.minsize. Since Hive 0.13, hive uses org.apache.hadoop.hive.ql.io.CombineHiveInputFormat by default for hive.input.format. This value also combines files that are smaller than mapreduce.input.fileinputformat.split.minsize assuming data is not of different nodes.

Does the block size and mapred size has anything to do with file size in hive?

Here is the formula to calculate split.

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))

Now based on this above formula, you should be able to tell how many files will be generated. This should answer your other questions too hopefully. Please let me know if you have a followup question.

Re: How Files loaded through a Hive table can be determined?

Thanks @mqureshi. Just wanted to confirm if my understanding is correct. Consider the input file size is 300mb, dfs block size=256mb, mapreduce.input.fileinputformat.split.maxsize=256mb and mapreduce.input.fileinputformat.split.minsize=128mb. If a hive command is triggered to load from one hive table to another then 2 mapred jobs will be triggered and the after loading the hive table I should be able to see 2 files as only 2 mapred jobs are triggered.

Adding one more question on top of it. What would be the maximum size of a file stored in a hive table. I believe it should be equal to the mapreduce job size. Please correct me if Im wrong.

I have loaded the hive table and no of files underneath the table is 6. Out of 6, one file has the size of 10mb. I have one more set of file into a same table and it created 2 more files out of which one file has 20mb. Now if you see there are 2 files (10mb & 20mb) which is stored as blocks wasting almost the 100mb of the block. Is there a way that it can be clubbed and stored in one block?

Thanks in advance for helping me out in understanding it.

Re: How Files loaded through a Hive table can be determined?

Super Guru
@Bala Vignesh N V

You have a file size of 300 MB and both block size and split max size are set to 256 MB. min size is 128 MB.

In this case split size will be 256 MB, so yes, you should see 2 mappers (its just one mapreduce job) and you should see two files.

As for max size, based on above formula, in this case it would be 256 MB.

You know, so far we assuming map only jobs. If you have reducers running, then you should also look at

hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducer

Check the following link and you will have to play a little bit to understand different scenarios.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

As for your merge question, simply use

hive.merge.mapfiles (for map only jobs) or hive.merge.mapredfiles for mapreduce jobs to merge small files.

Re: How Files loaded through a Hive table can be determined?

Thanks a ton @mqureshi.

Re: How Files loaded through a Hive table can be determined?

Rising Star

Number of mappers involved in a job= Number of input splits and number of input splits depends on your block size and file size .If file size is 256 mb and block size is 128mb it will involve 2mappers. @Bala Vignesh N V