Support Questions

Find answers, ask questions, and share your expertise

Loading a large fixedwidth file format into Hive

avatar
Explorer

I have a large 1881 column fixed width file format.  I want to create an external Hive table mapping to hdfs file.  So I was able to succesfully do this by using the RegexSerDe as shown below.  But when I try creating this with an 1881 column table, hive creates the table but you cannot run describe test_fixed_width as it errors out and throws an exception:  Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000

 

It appears that there is a buffer overflow error. How can I resolve this? Are there other techniques that could be applied?

CREATE EXTERNAL TABLE test_fixed_width (
Col1 STRING,
Col2 STRING,
Col3 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(.{3})(.{2})(.{1})",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
LOCATION '/data/';

4 REPLIES 4

avatar
Cloudera Employee

Do you see a stack trace in the hive.log? Can you paste the stack trace here?

avatar
Explorer

First: I get no error when creating the table.  It's when I try to DESCRIBE the table or select * from the table I get the following error in beeline command line or Hue interface:

 

 

Error while processing statement: FAILED: Execution Error, return code 1 from

org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000 (.{5}) (.{4}) (.{1}) (.{3}) (.{2}) (.{10}) (.{12}) (.{30}) (.{1}) (.{30}) (.{10}) (.{2}) (.{35}) (.{30}) (.{61}) (.{47}) (.{1}) (.{3}) (.{4}) (.{3}) (.{4}) (.{3}) (.{4}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{2}) (.{2}) (.{2}) 

 

In the Clouder Manager -> Diagnostic tab-> Logs I search Hive for Error or Fatal and nothing comes back.

avatar
Explorer

Here's what the hive log says.  Now it's not because there's an unclosed bracket.  As I can see it.  It runs out of storage space in the array for the input.reg.  I tested an input.reg with 400 attributes.  Then went up to 800 and it gave the same error.  So figuring that if the error says  Unclosed counted closure near index 4000 then with each attribute containing 6 size of an array aka (.{5}) then I bet you can only have around 666 attributes in the reg expression granted the size of an element was less then 10 otherwise you'd have less being that the array size would then be 7.

 

 (.{1
^
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:239)
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:269)
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:337)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:439)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:405)
at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:257)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:501)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:746)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.regex.PatternSyntaxException: Unclosed counted closure near index 4000

 

 (.{1})
(.{1
^
at java.util.regex.Pattern.error(Pattern.java:1955)
at java.util.regex.Pattern.closure(Pattern.java:3141)
at java.util.regex.Pattern.sequence(Pattern.java:2134)
at java.util.regex.Pattern.expr(Pattern.java:1996)
at java.util.regex.Pattern.group0(Pattern.java:2905)
at java.util.regex.Pattern.sequence(Pattern.java:2051)
at java.util.regex.Pattern.expr(Pattern.java:1996)
at java.util.regex.Pattern.compile(Pattern.java:1696)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1054)
at org.apache.hadoop.hive.contrib.serde2.RegexSerDe.initialize(RegexSerDe.java:111)
at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDeWithoutErrorCheck(SerDeUtils.java:542)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:389)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:273)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:266)
at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:3151)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:378)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1976)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1689)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1421)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1205)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1200)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237)
... 15 more

avatar
Explorer

So I was able to solve the issue but not using a Serde.  What I did was load the fixedwidth file into a single string and then used substring to parse through the line as follows:

 

CREATE TABLE FIXEDWIDTH_TEMP
(
whole_line STRING
)
LOCATION '/data/shared_workspace/test/fixedwidth_temp';

 

CREATE EXTERNAL TABLE FIXEDWIDTH_STG (
ZIP_CODE_3043 STRING,
ZIP4_CODE_3044 STRING,
PREFIX_TITLE_8093 STRING,
STATE_3038 STRING,
PHONE_3032 STRING,

...

..

1800 lines more

...

) LOCATION '/data/shared_workspace/test/fixedwidth_stg';

 

INSERT INTO FIXEDWIDTH_STG
SELECT
SUBSTRING(acxiom_line,1,5),
SUBSTRING(acxiom_line,6,4),
SUBSTRING(acxiom_line,10,1),
SUBSTRING(acxiom_line,11,3),
SUBSTRING(acxiom_line,14,2),
SUBSTRING(acxiom_line,16,10),
SUBSTRING(acxiom_line,26,12),

.

.

.1800 lines more

FROM FIXEDWIDTH_TEMP;