I've spent a lot of time over the past week trying to get regex tales to work in Hive. The table would create fine, but whenever I ran a query that required map reduce, the job would blow up. The issue is that for some reason core and task nodes don't have the jar file that contains the org.apache.hadoop.hive.contrib.serde2.RegexSerDe class in the classpath. The same basic issue exists on AWS EMR. I'm not sure if there is an issue with the way I'm using the class or if this is a bug in the configuration. My solution on CDH5 is below. On EMR, I had to create a script to run as a bootstrap action that copied the jar to the appropriate path (needed copy for other jars so decided to go this way).
Please comment if you have any suggestions or other solutions.
Thanks!
Scott
Solution:
cd /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop/lib/
ln -s ../../../jars/hive-contrib-0.13.1-cdh5.3.1.jar hive-contrib-0.13.1-cdh5.3.1.jar
Sample Create table DDL:
CREATE EXTERNAL TABLE TESTDATA (
FIELD1 STRING
, FIELD2 STRING
, FIELD3 STRING
, FIELD4 STRING
, FIELD5 STRING
, FIELD6 STRING
, FIELD7 STRING
, FIELD8 STRING
, FIELD9 STRING
, FIELD10 STRING
, FIELD11 STRING
, FIELD12 STRING
, FIELD13 STRING
, FIELD14 STRING
, FIELD15 STRING
, FIELD16 STRING
, FIELD17 STRING
, FIELD18 STRING
, FIELD19 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
With SERDEPROPERTIES (
"input.regex"="(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endf\\*+(.*)\\*endr\\*+(.*$)"
)
STORED AS TEXTFILE
LOCATION '/user/jsperson/testdata/'
TBLPROPERTIES("skip.header.line.count"="1", "serialization.null.format"='');