- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
REGEX tables with Hive fail map reduce
Created on ‎02-24-2015 07:02 AM - edited ‎09-16-2022 02:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've spent a lot of time over the past week trying to get regex tales to work in Hive. The table would create fine, but whenever I ran a query that required map reduce, the job would blow up. The issue is that for some reason core and task nodes don't have the jar file that contains the org.apache.hadoop.hive.contrib.serde2.RegexSerDe class in the classpath. The same basic issue exists on AWS EMR. I'm not sure if there is an issue with the way I'm using the class or if this is a bug in the configuration. My solution on CDH5 is below. On EMR, I had to create a script to run as a bootstrap action that copied the jar to the appropriate path (needed copy for other jars so decided to go this way).
Please comment if you have any suggestions or other solutions.
Thanks!
Scott
Solution:
FIELD1 STRING
, FIELD2 STRING
, FIELD3 STRING
, FIELD4 STRING
, FIELD5 STRING
, FIELD6 STRING
, FIELD7 STRING
, FIELD8 STRING
, FIELD9 STRING
, FIELD10 STRING
, FIELD11 STRING
, FIELD12 STRING
, FIELD13 STRING
, FIELD14 STRING
, FIELD15 STRING
, FIELD16 STRING
, FIELD17 STRING
, FIELD18 STRING
, FIELD19 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
With SERDEPROPERTIES (
"input.regex"="(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endr\\*+(.*$)"
)
STORED AS TEXTFILE
LOCATION '/user/jsperson/testdata/'
TBLPROPERTIES("skip.header.line.count"="1", "serialization.null.format"='');
Created ‎02-24-2015 10:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-24-2015 09:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-24-2015 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestion. I thought about add jar, but reminding all users to add the jar was problematic, and then getting that solution to work with ODBC had me stumped.
I guess my main question is why do we have to do this at all? Since this is fundamental Hive functionality, it would seem like the jar would be available to all nodes. Further, making it available only to the master is a curious configuration. Is there some reason why I would not want this jar in the lib directory all the time?
Thanks again!
Created ‎02-24-2015 10:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-24-2015 11:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
brock, thanks again for the reply. I admit to not being entirely convinced - I don't see why I should have to fiddle with jars to finish configuring a cluster in order to use basic functionality. BUT, I do appreciate your time, and the fact that you validated the feature.
v/r,
Scott
