Created 04-18-2017 01:50 AM
Hi guys,
I have an input file which looks like:
1:Washington Berry Juice 1356:Carrington Frozen Corn-41 446:Red Wing Plastic Knives-39 1133:Tri-State Almonds-41 1252:Skinner Strawberry Drink-39 868:Nationeel Raspberry Fruit Roll-39 360:Carlson Low Fat String Cheese-38 2:Washington Mango Drink 233:Best Choice Avocado Dip-61 1388:Sunset Paper Plates-63 878:Thresher Semi-Sweet Chocolate Bar-63 529:Fast BBQ Potato Chips-62 382:Moms Roasted Chicken-631 191:Musial Tasty Candy Bar-62
This is the output from user recommendation engine. The first pair is the main product ID and name. Next 6 are ProductId:Name:Count and all the 6 products are delimited by tab.
I want to load this data in a Hive table. As you can see here, there are multi delimeters, so I created a temporary table first having only one string column and then inserted this file. Next, i created a final table having the correct attributes and data types. Now when I am inserting the data using regular expression by running the query:
insert overwrite table recommendation SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) productId, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) productName, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) productId1, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) productName1, regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) productCount1, regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) productId2, regexp_extract(col_value, '^(?:([^,]*),?){7}', 1) productName2, regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) productCount2, regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) productId3, regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) productName3, regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) productCount3, regexp_extract(col_value, '^(?:([^,]*),?){12}', 1) productId4, regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) productName4, regexp_extract(col_value, '^(?:([^,]*),?){14}', 1) productCount4, regexp_extract(col_value, '^(?:([^,]*),?){15}', 1) productId5, regexp_extract(col_value, '^(?:([^,]*),?){16}', 1) productName5, regexp_extract(col_value, '^(?:([^,]*),?){17}', 1) productCount5, regexp_extract(col_value, '^(?:([^,]*),?){18}', 1) productId6, regexp_extract(col_value, '^(?:([^,]*),?){19}', 1) productName6, regexp_extract(col_value, '^(?:([^,]*),?){20}', 1) productCount6 from temp_recommendation;
I am getting this exception:
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties(Lorg/apache/hadoop/mapred/Task;Ljava/util/List;Lorg/apache/hadoop/conf/Configuration;)V
There are no logs generated and this is a pseudo distributed machine. Is this method wrong for handling multi delimiters or is there any other other better way? Thanks in advance
Created 04-20-2017 12:59 PM
you can define tab delimited table. This will give you already 7 columns, without any regex for them.
then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)
or you can also use split instead of regex.
insert overwrite table recommendation SELECT regexp_extract(col_p, '(\d+):', 1) productId, regexp_extract(col_p, ':(.+)', 1) productName, regexp_extract(col_p1, '(\d+):', 1) productId1, regexp_extract(col_p1, ':([^-]+)', 1) productName1, regexp_extract(col_p1, '-(.+)', 1) productCount1, regexp_extract(col_p2, '(\d+):', 1) productId2, regexp_extract(col_p2, ':([^-]+)', 1) productName2, regexp_extract(col_p2, '-(.+)', 1) productCount2, regexp_extract(col_p3, '(\d+):', 1) productId3, regexp_extract(col_p3, ':([^-]+)', 1) productName3, regexp_extract(col_p3, '-(.+)', 1) productCount3, regexp_extract(col_p4, '(\d+):', 1) productId4, regexp_extract(col_p4, ':([^-]+)', 1) productName4, regexp_extract(col_p4, '-(.+)', 1) productCount4, regexp_extract(col_p5, '(\d+):', 1) productId5, regexp_extract(col_p5, ':([^-]+)', 1) productName5, regexp_extract(col_p5, '-(.+)', 1) productCount5, regexp_extract(col_p6, '(\d+):', 1) productId6, regexp_extract(col_p6, ':([^-]+)', 1) productName6, regexp_extract(col_p6, '-(.+)', 1) productCount6 from temp_recommendation;
Created 04-20-2017 12:59 PM
you can define tab delimited table. This will give you already 7 columns, without any regex for them.
then, in order to extract particular parts from each product column use "(\d+):([^-\t]+)-(\d+)" as regex (except for the first one, which would be just "(\d+):([^-\t]+)"...)
or you can also use split instead of regex.
insert overwrite table recommendation SELECT regexp_extract(col_p, '(\d+):', 1) productId, regexp_extract(col_p, ':(.+)', 1) productName, regexp_extract(col_p1, '(\d+):', 1) productId1, regexp_extract(col_p1, ':([^-]+)', 1) productName1, regexp_extract(col_p1, '-(.+)', 1) productCount1, regexp_extract(col_p2, '(\d+):', 1) productId2, regexp_extract(col_p2, ':([^-]+)', 1) productName2, regexp_extract(col_p2, '-(.+)', 1) productCount2, regexp_extract(col_p3, '(\d+):', 1) productId3, regexp_extract(col_p3, ':([^-]+)', 1) productName3, regexp_extract(col_p3, '-(.+)', 1) productCount3, regexp_extract(col_p4, '(\d+):', 1) productId4, regexp_extract(col_p4, ':([^-]+)', 1) productName4, regexp_extract(col_p4, '-(.+)', 1) productCount4, regexp_extract(col_p5, '(\d+):', 1) productId5, regexp_extract(col_p5, ':([^-]+)', 1) productName5, regexp_extract(col_p5, '-(.+)', 1) productCount5, regexp_extract(col_p6, '(\d+):', 1) productId6, regexp_extract(col_p6, ':([^-]+)', 1) productName6, regexp_extract(col_p6, '-(.+)', 1) productCount6 from temp_recommendation;
Created 04-21-2017 08:15 PM
if that worked for you, please accept the answer so your question will be marked as resolved. Thanks.
Created 04-22-2017 07:17 PM
Thanks for the response. I am still getting the same exception while doing regexp_extract.
Created 04-22-2017 07:19 PM
Just saw that any query in Hive which involves map reduce job is giving the same exception.
Created 04-22-2017 10:25 PM
Got it right. Actually whenever I was starting my hive shell, I was getting this warning:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
So I installed Tez(version 0.8.5) and changed the execution engine of Hive to Tez. Now all Hive queries that involve MapReduce job are running. My hive version is 2.1.1, that I guess do not work with MapReduce
As per the regex, thanks alot @Ed Berezitsky,