Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Getting exception while inserting the file having multi delimiters in a Hive table

avatar
Super Collaborator

Hi guys,

I have an input file which looks like:

1:Washington Berry Juice   1356:Carrington Frozen Corn-41   446:Red Wing Plastic Knives-39   1133:Tri-State Almonds-41   1252:Skinner Strawberry Drink-39   868:Nationeel Raspberry Fruit Roll-39   360:Carlson Low Fat String Cheese-38   
2:Washington Mango Drink   233:Best Choice Avocado Dip-61   1388:Sunset Paper Plates-63   878:Thresher Semi-Sweet Chocolate Bar-63   529:Fast BBQ Potato Chips-62   382:Moms Roasted Chicken-631	191:Musial Tasty Candy Bar-62  

This is the output from user recommendation engine. The first pair is the main product ID and name. Next 6 are ProductId:Name:Count and all the 6 products are delimited by tab.

I want to load this data in a Hive table. As you can see here, there are multi delimeters, so I created a temporary table first having only one string column and then inserted this file. Next, i created a final table having the correct attributes and data types. Now when I am inserting the data using regular expression by running the query:

insert overwrite table recommendation SELECT  
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) productId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) productName,  
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) productId1,  
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) productName1,  
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) productCount1,  
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) productId2,  
regexp_extract(col_value, '^(?:([^,]*),?){7}', 1) productName2,  
regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) productCount2,  
regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) productId3,  
regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) productName3,  
regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) productCount3,  
regexp_extract(col_value, '^(?:([^,]*),?){12}', 1) productId4,  
regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) productName4,  
regexp_extract(col_value, '^(?:([^,]*),?){14}', 1) productCount4,  
regexp_extract(col_value, '^(?:([^,]*),?){15}', 1) productId5,  
regexp_extract(col_value, '^(?:([^,]*),?){16}', 1) productName5,  
regexp_extract(col_value, '^(?:([^,]*),?){17}', 1) productCount5,  
regexp_extract(col_value, '^(?:([^,]*),?){18}', 1) productId6,  
regexp_extract(col_value, '^(?:([^,]*),?){19}', 1) productName6,  
regexp_extract(col_value, '^(?:([^,]*),?){20}', 1) productCount6
from temp_recommendation;

I am getting this exception:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties(Lorg/apache/hadoop/mapred/Task;Ljava/util/List;Lorg/apache/hadoop/conf/Configuration;)V

There are no logs generated and this is a pseudo distributed machine. Is this method wrong for handling multi delimiters or is there any other other better way? Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Mushtaq Rizvi,

you can define tab delimited table. This will give you already 7 columns, without any regex for them.

then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)

or you can also use split instead of regex.

insert overwrite table recommendation SELECT  
regexp_extract(col_p, '(\d+):', 1) 		productId,
regexp_extract(col_p, ':(.+)', 1) 		productName,  
regexp_extract(col_p1, '(\d+):', 1) 	productId1,  
regexp_extract(col_p1, ':([^-]+)', 1) 	productName1,  
regexp_extract(col_p1, '-(.+)', 1) 		productCount1,  
regexp_extract(col_p2, '(\d+):', 1) 	 productId2,  
regexp_extract(col_p2, ':([^-]+)', 1)  productName2,  
regexp_extract(col_p2, '-(.+)', 1) 		 productCount2,  
regexp_extract(col_p3, '(\d+):', 1) 	 productId3,  
regexp_extract(col_p3, ':([^-]+)', 1)  productName3,  
regexp_extract(col_p3, '-(.+)', 1) 		 productCount3,  
regexp_extract(col_p4, '(\d+):', 1) 	 productId4,  
regexp_extract(col_p4, ':([^-]+)', 1)  productName4,  
regexp_extract(col_p4, '-(.+)', 1) 		 productCount4,  
regexp_extract(col_p5, '(\d+):', 1) 	 productId5,  
regexp_extract(col_p5, ':([^-]+)', 1)  productName5,  
regexp_extract(col_p5, '-(.+)', 1) 		 productCount5,  
regexp_extract(col_p6, '(\d+):', 1) 	 productId6,  
regexp_extract(col_p6, ':([^-]+)', 1)  productName6,  
regexp_extract(col_p6, '-(.+)', 1) 		 productCount6
from temp_recommendation;

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

@Mushtaq Rizvi,

you can define tab delimited table. This will give you already 7 columns, without any regex for them.

then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)

or you can also use split instead of regex.

insert overwrite table recommendation SELECT  
regexp_extract(col_p, '(\d+):', 1) 		productId,
regexp_extract(col_p, ':(.+)', 1) 		productName,  
regexp_extract(col_p1, '(\d+):', 1) 	productId1,  
regexp_extract(col_p1, ':([^-]+)', 1) 	productName1,  
regexp_extract(col_p1, '-(.+)', 1) 		productCount1,  
regexp_extract(col_p2, '(\d+):', 1) 	 productId2,  
regexp_extract(col_p2, ':([^-]+)', 1)  productName2,  
regexp_extract(col_p2, '-(.+)', 1) 		 productCount2,  
regexp_extract(col_p3, '(\d+):', 1) 	 productId3,  
regexp_extract(col_p3, ':([^-]+)', 1)  productName3,  
regexp_extract(col_p3, '-(.+)', 1) 		 productCount3,  
regexp_extract(col_p4, '(\d+):', 1) 	 productId4,  
regexp_extract(col_p4, ':([^-]+)', 1)  productName4,  
regexp_extract(col_p4, '-(.+)', 1) 		 productCount4,  
regexp_extract(col_p5, '(\d+):', 1) 	 productId5,  
regexp_extract(col_p5, ':([^-]+)', 1)  productName5,  
regexp_extract(col_p5, '-(.+)', 1) 		 productCount5,  
regexp_extract(col_p6, '(\d+):', 1) 	 productId6,  
regexp_extract(col_p6, ':([^-]+)', 1)  productName6,  
regexp_extract(col_p6, '-(.+)', 1) 		 productCount6
from temp_recommendation;

avatar
Super Collaborator

@Mushtaq Rizvi,

if that worked for you, please accept the answer so your question will be marked as resolved. Thanks.

avatar
Super Collaborator

Thanks for the response. I am still getting the same exception while doing regexp_extract.

avatar
Super Collaborator

Just saw that any query in Hive which involves map reduce job is giving the same exception.

avatar
Super Collaborator

Got it right. Actually whenever I was starting my hive shell, I was getting this warning:

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

So I installed Tez(version 0.8.5) and changed the execution engine of Hive to Tez. Now all Hive queries that involve MapReduce job are running. My hive version is 2.1.1, that I guess do not work with MapReduce

As per the regex, thanks alot @Ed Berezitsky,

Those regex worked.