Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Getting exception while inserting the file having multi delimiters in a Hive table

avatar
Super Collaborator

Hi guys,

I have an input file which looks like:

1:Washington Berry Juice   1356:Carrington Frozen Corn-41   446:Red Wing Plastic Knives-39   1133:Tri-State Almonds-41   1252:Skinner Strawberry Drink-39   868:Nationeel Raspberry Fruit Roll-39   360:Carlson Low Fat String Cheese-38   
2:Washington Mango Drink   233:Best Choice Avocado Dip-61   1388:Sunset Paper Plates-63   878:Thresher Semi-Sweet Chocolate Bar-63   529:Fast BBQ Potato Chips-62   382:Moms Roasted Chicken-631	191:Musial Tasty Candy Bar-62  

This is the output from user recommendation engine. The first pair is the main product ID and name. Next 6 are ProductId:Name:Count and all the 6 products are delimited by tab.

I want to load this data in a Hive table. As you can see here, there are multi delimeters, so I created a temporary table first having only one string column and then inserted this file. Next, i created a final table having the correct attributes and data types. Now when I am inserting the data using regular expression by running the query:

insert overwrite table recommendation SELECT  
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) productId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) productName,  
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) productId1,  
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) productName1,  
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) productCount1,  
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) productId2,  
regexp_extract(col_value, '^(?:([^,]*),?){7}', 1) productName2,  
regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) productCount2,  
regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) productId3,  
regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) productName3,  
regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) productCount3,  
regexp_extract(col_value, '^(?:([^,]*),?){12}', 1) productId4,  
regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) productName4,  
regexp_extract(col_value, '^(?:([^,]*),?){14}', 1) productCount4,  
regexp_extract(col_value, '^(?:([^,]*),?){15}', 1) productId5,  
regexp_extract(col_value, '^(?:([^,]*),?){16}', 1) productName5,  
regexp_extract(col_value, '^(?:([^,]*),?){17}', 1) productCount5,  
regexp_extract(col_value, '^(?:([^,]*),?){18}', 1) productId6,  
regexp_extract(col_value, '^(?:([^,]*),?){19}', 1) productName6,  
regexp_extract(col_value, '^(?:([^,]*),?){20}', 1) productCount6
from temp_recommendation;

I am getting this exception:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties(Lorg/apache/hadoop/mapred/Task;Ljava/util/List;Lorg/apache/hadoop/conf/Configuration;)V

There are no logs generated and this is a pseudo distributed machine. Is this method wrong for handling multi delimiters or is there any other other better way? Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Mushtaq Rizvi,

you can define tab delimited table. This will give you already 7 columns, without any regex for them.

then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)

or you can also use split instead of regex.

insert overwrite table recommendation SELECT  
regexp_extract(col_p, '(\d+):', 1) 		productId,
regexp_extract(col_p, ':(.+)', 1) 		productName,  
regexp_extract(col_p1, '(\d+):', 1) 	productId1,  
regexp_extract(col_p1, ':([^-]+)', 1) 	productName1,  
regexp_extract(col_p1, '-(.+)', 1) 		productCount1,  
regexp_extract(col_p2, '(\d+):', 1) 	 productId2,  
regexp_extract(col_p2, ':([^-]+)', 1)  productName2,  
regexp_extract(col_p2, '-(.+)', 1) 		 productCount2,  
regexp_extract(col_p3, '(\d+):', 1) 	 productId3,  
regexp_extract(col_p3, ':([^-]+)', 1)  productName3,  
regexp_extract(col_p3, '-(.+)', 1) 		 productCount3,  
regexp_extract(col_p4, '(\d+):', 1) 	 productId4,  
regexp_extract(col_p4, ':([^-]+)', 1)  productName4,  
regexp_extract(col_p4, '-(.+)', 1) 		 productCount4,  
regexp_extract(col_p5, '(\d+):', 1) 	 productId5,  
regexp_extract(col_p5, ':([^-]+)', 1)  productName5,  
regexp_extract(col_p5, '-(.+)', 1) 		 productCount5,  
regexp_extract(col_p6, '(\d+):', 1) 	 productId6,  
regexp_extract(col_p6, ':([^-]+)', 1)  productName6,  
regexp_extract(col_p6, '-(.+)', 1) 		 productCount6
from temp_recommendation;

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

@Mushtaq Rizvi,

you can define tab delimited table. This will give you already 7 columns, without any regex for them.

then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)

or you can also use split instead of regex.

insert overwrite table recommendation SELECT  
regexp_extract(col_p, '(\d+):', 1) 		productId,
regexp_extract(col_p, ':(.+)', 1) 		productName,  
regexp_extract(col_p1, '(\d+):', 1) 	productId1,  
regexp_extract(col_p1, ':([^-]+)', 1) 	productName1,  
regexp_extract(col_p1, '-(.+)', 1) 		productCount1,  
regexp_extract(col_p2, '(\d+):', 1) 	 productId2,  
regexp_extract(col_p2, ':([^-]+)', 1)  productName2,  
regexp_extract(col_p2, '-(.+)', 1) 		 productCount2,  
regexp_extract(col_p3, '(\d+):', 1) 	 productId3,  
regexp_extract(col_p3, ':([^-]+)', 1)  productName3,  
regexp_extract(col_p3, '-(.+)', 1) 		 productCount3,  
regexp_extract(col_p4, '(\d+):', 1) 	 productId4,  
regexp_extract(col_p4, ':([^-]+)', 1)  productName4,  
regexp_extract(col_p4, '-(.+)', 1) 		 productCount4,  
regexp_extract(col_p5, '(\d+):', 1) 	 productId5,  
regexp_extract(col_p5, ':([^-]+)', 1)  productName5,  
regexp_extract(col_p5, '-(.+)', 1) 		 productCount5,  
regexp_extract(col_p6, '(\d+):', 1) 	 productId6,  
regexp_extract(col_p6, ':([^-]+)', 1)  productName6,  
regexp_extract(col_p6, '-(.+)', 1) 		 productCount6
from temp_recommendation;

avatar
Super Collaborator

@Mushtaq Rizvi,

if that worked for you, please accept the answer so your question will be marked as resolved. Thanks.

avatar
Super Collaborator

Thanks for the response. I am still getting the same exception while doing regexp_extract.

avatar
Super Collaborator

Just saw that any query in Hive which involves map reduce job is giving the same exception.

avatar
Super Collaborator

Got it right. Actually whenever I was starting my hive shell, I was getting this warning:

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

So I installed Tez(version 0.8.5) and changed the execution engine of Hive to Tez. Now all Hive queries that involve MapReduce job are running. My hive version is 2.1.1, that I guess do not work with MapReduce

As per the regex, thanks alot @Ed Berezitsky,

Those regex worked.