Created 05-05-2017 10:27 PM
Hello,
I created a Hive table (text format, no partitions) using Sqoop import; after creating the table, wanting to change the table to ORC format, I ran "ALTER TABLE TESTDB.DST_TABLE SET FILEFORMAT ORC"; later I learned that it would NOT change the existing data to ORC format; Now, when I try "select * from ...", to read data from the table, it throw this error below:
"java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://haserver/apps/hive/warehouse/testdb.db/dst_table/part-m-00000.deflate. Invalid postscript. (state=,code=0)"
I can't do CTAS either, that one throws another error:
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1493740850622_0077_1_00, diagnostics=[Vertex vertex_1493740850622_0077_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: dst_table initializer failed, vertex=vertex_1493740850622_0077_1_00 [Map 1], java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1273) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1300) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:307) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:409) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://haserver/apps/hive/warehouse/testdb.db/dst_table/part-m-00000.deflate. Invalid postscript. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1268) ... 15 more Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs:// haserver/apps/hive/warehouse/testdb.db/dst_table/part-m-00000.deflate. Invalid postscript. at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:257) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:384) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:321) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1099) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1001) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2000(OrcInputFormat.java:838) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:992) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:989) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:989) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:838) ... 4 more ] ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1493740850622_0077_1_00, diagnostics=[Vertex vertex_1493740850622_0077_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: dst_switch_dates initializer failed, vertex=vertex_1493740850622_0077_1_00 [Map 1], java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1273) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1300) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:307) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:409) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://haserver/apps/hive/warehouse/testdb.db/dst_table/part-m-00000.deflate. Invalid postscript. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1268) ... 15 more Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs:// haserver/apps/hive/warehouse/testdb.db/dst_table/part-m-00000.deflate. Invalid postscript. at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:257) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:384) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:321) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1099) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1001) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2000(OrcInputFormat.java:838) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:992) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:989) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:989) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:838) ... 4 more ]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2) 0
It would not let me alter the table format back to TEXTFILE either:
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Changing file format (from ORC) is not supported for table testdb.dst_table (state=08S01,code=1)
What I can do to recover/repair the table, so I can get to the data in the table.
Thanks in advance.
Created 05-06-2017 08:54 AM
Hi @Raj B
Are you able to view the base file available in HDFS of the hive ORC table? If you are able to see that then create a new table with same structure with TEXTFILE as format then copy the HDFS file from old table to new table. Just copy the base file and not via hive table. It should work good for you !
Created 05-05-2017 11:56 PM
I am not sure if there is a possible way to recover/repair this table.
Here's another thread which discusses a similar issue - https://community.hortonworks.com/questions/68669/csv-query-to-run-from-hivedefaultfileformat-is-orc...
However, I can suggest 2 solutions to achieve this:
1. The import from RDBMS to Hive in ORC format via HCatalog is supported. You can create a Hive table stored as ORC like below
$ hive -e "CREATE TABLE cust (id int, name string) STORED AS ORCFILE;"
When running Sqoop import to Hive, you can use --hcatalog-database and --hcatalog-table options instead of the --hive-table option as described in https://cwiki.apache.org/confluence/download/attachments/27361435/SqoopHCatIntegration-HadoopWorld20...
2. The same way you did the first time, use Sqoop to import data into a temporary Hive managed table. Create an ORC table in Hive($ hive -e "CREATE TABLE cust (id int, name string) STORED AS ORCFILE;") and finally an insert from temporary table to the ORC table. ($ hive -e "INSERT OVERWRITE TABLE mydrivers SELECT * FROM drivers;")
Created 05-06-2017 03:47 AM
@ksuresh, my question is not about how to do it again, rather how can I get back the data from the table that I have now, since this is a large table, I would rather not load the table again.
I did try the hcatalog route, but it was taking extremely long time;
Regarding the 2nd option you mentioned, in the two step process, the 2nd step - writing data from the staging table to ORC table - takes just as long as loading the staging table from sqoop, correct ?
Created 05-06-2017 08:54 AM
Hi @Raj B
Are you able to view the base file available in HDFS of the hive ORC table? If you are able to see that then create a new table with same structure with TEXTFILE as format then copy the HDFS file from old table to new table. Just copy the base file and not via hive table. It should work good for you !
Created 05-06-2017 04:13 PM
Its true that it won't change the layout, however, you can create an external table on top of the same location in its original format i.e.
create external table abc (col1 int, col2 string) location '/location/of/modified/orc/table' stored as textfile (or original format);
Ideally if the data has not changed then you should be able to view it and then read or do a CTAS from that.