Support Questions

Find answers, ask questions, and share your expertise

REGEX tables with Hive fail map reduce

avatar

I've spent a lot of time over the past week trying to get regex tales to work in Hive. The table would create fine, but whenever I ran a query that required map reduce, the job would blow up. The issue is that for some reason core and task nodes don't have the jar file that contains the org.apache.hadoop.hive.contrib.serde2.RegexSerDe class in the classpath. The same basic issue exists on AWS EMR. I'm not sure if there is an issue with the way I'm using the class or if this is a bug in the configuration. My solution on CDH5 is below. On EMR, I had to create a script to run as a bootstrap action that copied the jar to the appropriate path (needed copy for other jars so decided to go this way).

 

Please comment if you have any suggestions or other solutions.

Thanks!

Scott

 

 

Solution:

cd /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop/lib/
ln -s ../../../jars/hive-contrib-0.13.1-cdh5.3.1.jar hive-contrib-0.13.1-cdh5.3.1.jar
 
 
Sample Create table DDL:
CREATE EXTERNAL TABLE TESTDATA (
  FIELD1 STRING
  , FIELD2 STRING
  , FIELD3 STRING
  , FIELD4 STRING
  , FIELD5 STRING
  , FIELD6 STRING
  , FIELD7 STRING
  , FIELD8 STRING
  , FIELD9 STRING
  , FIELD10 STRING
  , FIELD11 STRING
  , FIELD12 STRING
  , FIELD13 STRING
  , FIELD14 STRING
  , FIELD15 STRING
  , FIELD16 STRING
  , FIELD17 STRING
  , FIELD18 STRING
  , FIELD19 STRING
  )
  ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
  With SERDEPROPERTIES (
          "input.regex"="(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endr\\*+(.*$)"
    )
STORED AS TEXTFILE
LOCATION '/user/jsperson/testdata/'
TBLPROPERTIES("skip.header.line.count"="1", "serialization.null.format"='');
1 ACCEPTED SOLUTION

avatar
Contributor
Since classpath conflicts are common most folks don't like to "pollute" the classpath with extra unused jar files. This is can lead to "classpath hell". Since this jar is optional, it requires being added to the classpath via the mechanism you used or the two I proposed.

View solution in original post

4 REPLIES 4

avatar
Contributor
Either use add jar .... or add the jar to the Hive Aux Classpath.

avatar

Thanks for the suggestion. I thought about add jar, but reminding all users to add the jar was problematic, and then getting that solution to work with ODBC had me stumped.

 

I guess my main question is why do we have to do this at all? Since this is fundamental Hive functionality, it would seem like the jar would be available to all nodes. Further, making it available only to the master is a curious configuration. Is there some reason why I would not want this jar in the lib directory all the time?

 

Thanks again!

avatar
Contributor
Since classpath conflicts are common most folks don't like to "pollute" the classpath with extra unused jar files. This is can lead to "classpath hell". Since this jar is optional, it requires being added to the classpath via the mechanism you used or the two I proposed.

avatar

brock, thanks again for the reply. I admit to not being entirely convinced - I don't see why I should have to fiddle with jars to finish configuring a cluster in order to use basic functionality. BUT, I do appreciate your time, and the fact that you validated the feature.

v/r,

Scott