Created on 03-31-2017 12:51 PM - edited 09-16-2022 04:23 AM
I have a large 1881 column fixed width file format. I want to create an external Hive table mapping to hdfs file. So I was able to succesfully do this by using the RegexSerDe as shown below. But when I try creating this with an 1881 column table, hive creates the table but you cannot run describe test_fixed_width as it errors out and throws an exception: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000
It appears that there is a buffer overflow error. How can I resolve this? Are there other techniques that could be applied?
CREATE EXTERNAL TABLE test_fixed_width (
Col1 STRING,
Col2 STRING,
Col3 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(.{3})(.{2})(.{1})",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
LOCATION '/data/';
Created 03-31-2017 02:47 PM
Do you see a stack trace in the hive.log? Can you paste the stack trace here?
Created 03-31-2017 03:03 PM
First: I get no error when creating the table. It's when I try to DESCRIBE the table or select * from the table I get the following error in beeline command line or Hue interface:
Error while processing statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000 (.{5}) (.{4}) (.{1}) (.{3}) (.{2}) (.{10}) (.{12}) (.{30}) (.{1}) (.{30}) (.{10}) (.{2}) (.{35}) (.{30}) (.{61}) (.{47}) (.{1}) (.{3}) (.{4}) (.{3}) (.{4}) (.{3}) (.{4}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{2}) (.{2}) (.{2})
In the Clouder Manager -> Diagnostic tab-> Logs I search Hive for Error or Fatal and nothing comes back.
Created 03-31-2017 03:11 PM
Here's what the hive log says. Now it's not because there's an unclosed bracket. As I can see it. It runs out of storage space in the array for the input.reg. I tested an input.reg with 400 attributes. Then went up to 800 and it gave the same error. So figuring that if the error says Unclosed counted closure near index 4000 then with each attribute containing 6 size of an array aka (.{5}) then I bet you can only have around 666 attributes in the reg expression granted the size of an element was less then 10 otherwise you'd have less being that the array size would then be 7.
(.{1
^
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:239)
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:269)
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:337)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:439)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:405)
at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:257)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:501)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:746)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.regex.PatternSyntaxException: Unclosed counted closure near index 4000
(.{1})
(.{1
^
at java.util.regex.Pattern.error(Pattern.java:1955)
at java.util.regex.Pattern.closure(Pattern.java:3141)
at java.util.regex.Pattern.sequence(Pattern.java:2134)
at java.util.regex.Pattern.expr(Pattern.java:1996)
at java.util.regex.Pattern.group0(Pattern.java:2905)
at java.util.regex.Pattern.sequence(Pattern.java:2051)
at java.util.regex.Pattern.expr(Pattern.java:1996)
at java.util.regex.Pattern.compile(Pattern.java:1696)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1054)
at org.apache.hadoop.hive.contrib.serde2.RegexSerDe.initialize(RegexSerDe.java:111)
at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDeWithoutErrorCheck(SerDeUtils.java:542)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:389)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:273)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:266)
at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:3151)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:378)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1976)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1689)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1421)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1205)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1200)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237)
... 15 more
Created 04-01-2017 09:25 PM
So I was able to solve the issue but not using a Serde. What I did was load the fixedwidth file into a single string and then used substring to parse through the line as follows:
CREATE TABLE FIXEDWIDTH_TEMP
(
whole_line STRING
)
LOCATION '/data/shared_workspace/test/fixedwidth_temp';
CREATE EXTERNAL TABLE FIXEDWIDTH_STG (
ZIP_CODE_3043 STRING,
ZIP4_CODE_3044 STRING,
PREFIX_TITLE_8093 STRING,
STATE_3038 STRING,
PHONE_3032 STRING,
...
..
1800 lines more
...
) LOCATION '/data/shared_workspace/test/fixedwidth_stg';
INSERT INTO FIXEDWIDTH_STG
SELECT
SUBSTRING(acxiom_line,1,5),
SUBSTRING(acxiom_line,6,4),
SUBSTRING(acxiom_line,10,1),
SUBSTRING(acxiom_line,11,3),
SUBSTRING(acxiom_line,14,2),
SUBSTRING(acxiom_line,16,10),
SUBSTRING(acxiom_line,26,12),
.
.
.1800 lines more
FROM FIXEDWIDTH_TEMP;