- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How Files loaded through a Hive table can be determined?
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎05-11-2017 01:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider I have a File size of 150M and Im loading into a hive table. Block size in HDFS is 128MB.
Now how the files will be present underneath the Hive?.
I believe it will be splitted and loaded as 0000_0,0000_1 .,.etc.
Why it is splitted into multiple chunks? Does each files represent the block size?
Does the block size and mapred size has anything to do with file size in hive?
If I alter the mapred size then will the file size underneath hive will be changed?
Do we have any control over the no of files created in hive while loading? I understand through merge mapreduce job we will be able to reduce/increase it. Say I just need 10 files to be created and not more than that while loading a hive table. Is it even possible?
Thanks in advance
Created ‎05-12-2017 02:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why it is splitted into multiple chunks? Does each files represent the block size?
Are you importing data using Sqoop? In that case. it is based on --number-mappers or -m argument. In this case no, it just depends on number of mappers and data is split by primary key unless you specify split-by on a different key.
If you are not using Sqoop then hive uses mapreduce.input.fileinputformat.split.minsize. Since Hive 0.13, hive uses org.apache.hadoop.hive.ql.io.CombineHiveInputFormat by default for hive.input.format. This value also combines files that are smaller than mapreduce.input.fileinputformat.split.minsize assuming data is not of different nodes.
Does the block size and mapred size has anything to do with file size in hive?
Here is the formula to calculate split.
max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))
Now based on this above formula, you should be able to tell how many files will be generated. This should answer your other questions too hopefully. Please let me know if you have a followup question.
Created ‎05-12-2017 07:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mqureshi @Neeraj Sabharwal @Jay SenSharma Could anyone help me on this please? Thanks in advance!
Created ‎05-12-2017 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends on the no. of mappers. suppose if you are importing files from mysql to hive, then the command would be-
sqoop import --connect jdbc:mysql://ipaddress/databasename --username abc --password abc --table tablename --driver com.mysql.jdbc.Driver --hive-table hivetablename --hive-import -m 1 here -m (parameter), which defines no. of mappers, will define the number of files to be created. in my example I have mentioned -m 1, so only one file will get created after importing.
Created ‎05-12-2017 03:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Mohit Varshney I understand that when i ingest data through sqoop the no of mappers will decide the no of files ingested into a hive table. I just wanted to understand the same when im loading a data between hive tables. Thanks
Created ‎05-12-2017 02:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why it is splitted into multiple chunks? Does each files represent the block size?
Are you importing data using Sqoop? In that case. it is based on --number-mappers or -m argument. In this case no, it just depends on number of mappers and data is split by primary key unless you specify split-by on a different key.
If you are not using Sqoop then hive uses mapreduce.input.fileinputformat.split.minsize. Since Hive 0.13, hive uses org.apache.hadoop.hive.ql.io.CombineHiveInputFormat by default for hive.input.format. This value also combines files that are smaller than mapreduce.input.fileinputformat.split.minsize assuming data is not of different nodes.
Does the block size and mapred size has anything to do with file size in hive?
Here is the formula to calculate split.
max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))
Now based on this above formula, you should be able to tell how many files will be generated. This should answer your other questions too hopefully. Please let me know if you have a followup question.
Created ‎05-12-2017 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @mqureshi. Just wanted to confirm if my understanding is correct. Consider the input file size is 300mb, dfs block size=256mb, mapreduce.input.fileinputformat.split.maxsize=256mb and mapreduce.input.fileinputformat.split.minsize=128mb. If a hive command is triggered to load from one hive table to another then 2 mapred jobs will be triggered and the after loading the hive table I should be able to see 2 files as only 2 mapred jobs are triggered.
Adding one more question on top of it. What would be the maximum size of a file stored in a hive table. I believe it should be equal to the mapreduce job size. Please correct me if Im wrong.
I have loaded the hive table and no of files underneath the table is 6. Out of 6, one file has the size of 10mb. I have one more set of file into a same table and it created 2 more files out of which one file has 20mb. Now if you see there are 2 files (10mb & 20mb) which is stored as blocks wasting almost the 100mb of the block. Is there a way that it can be clubbed and stored in one block?
Thanks in advance for helping me out in understanding it.
Created ‎05-12-2017 04:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have a file size of 300 MB and both block size and split max size are set to 256 MB. min size is 128 MB.
In this case split size will be 256 MB, so yes, you should see 2 mappers (its just one mapreduce job) and you should see two files.
As for max size, based on above formula, in this case it would be 256 MB.
You know, so far we assuming map only jobs. If you have reducers running, then you should also look at
hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducerCheck the following link and you will have to play a little bit to understand different scenarios.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
As for your merge question, simply use
hive.merge.mapfiles (for map only jobs) or hive.merge.mapredfiles for mapreduce jobs to merge small files.Created ‎05-12-2017 04:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a ton @mqureshi.
Created ‎05-12-2017 05:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Number of mappers involved in a job= Number of input splits and number of input splits depends on your block size and file size .If file size is 256 mb and block size is 128mb it will involve 2mappers. @Bala Vignesh N V
