- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
when inserting data from hive parquet table with partition to another parquet table with partition, exception : hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available, is thrown.
- Labels:
-
Apache Hive
Created ‎02-24-2016 11:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While inserting from hive external table P1 stored as Parquet ( partitioned on column e.g. col A ) to the another table P2 stored as Parquet and having same number of columns as table P1 but partitioned on different column ( e.g. Col B), hive throws Premature EOF exception.
exception : hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available.
Any idea the cause of issue.
HDP2.3 cluster with 4 datanodes. Process is running with sufficient map memory and container size.
I have tried with running with TEZ as well as Mapreduce. But same error.
Thanks,
Harshal
Created ‎11-07-2016 08:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case of Hive on Tez, decreasing tez.grouping.max-size might help you. I faced almost same problem before and I decreased tez.grouping.max-size from 1GB to 256MB, then that problem almost(not perfectly) solved.
Created ‎02-24-2016 12:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not use ORC, whats the use case that it requires parquet?
Created ‎02-25-2016 05:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So you mean this is something to do with Parquet ? Parquet has good integration with Spark .
Created ‎02-25-2016 06:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This might be a Parquet problem, but could also be something else. I have seen some performance and job issues when using Parquet instead of ORC. Have you seen this https://issues.apache.org/jira/browse/HDFS-8475
What features are you missing regarding SparkORC?
I have seen you error before, but in a different context (Query on ORC table was failing)
Make sure your HDFS (especially the DNs) are running and healthy. It might be related to some bad blocks, so make sure the blocks that are related to your job are ok
Created ‎02-26-2016 05:42 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply Jonas !
I have verified datanode health and its all fine, there are no corrupt blocks across filesystem. I will check by changing the format to ORC.
Created ‎02-26-2016 07:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Same exception for orc hive table as well.
looks like this is a generic issue for below case :
1. create external table T1(col A , B , C) with partition on (col A) stored as ORC . Load table with substantial data. in my case around 85 GB data.
2. Create external table T2(col A,B,C) with partition on (Col B) stored as ORC. Load table T2 from T1 with dynamic partition.
Output :- Premature EOF exception
Please try out !
Created ‎11-07-2016 08:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case of Hive on Tez, decreasing tez.grouping.max-size might help you. I faced almost same problem before and I decreased tez.grouping.max-size from 1GB to 256MB, then that problem almost(not perfectly) solved.
Created ‎11-08-2016 06:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply !!
Issue was resolved by increasing value for dfs.datanode.max.transfer.threads to 16000 in my case.
Also increasing ulimit value on each worker node.
Regards,
Harshal
