Support Questions

ygnhzeus · ‎06-21-2014

Hello,

I'm trying to use parquet fileformat and it works fine if I write data using Impala and read it in Hive. However, if I insert data to that table via Hive and read it using Impala, Impala will throw errors like:

ERRORS:

Backend 2: Parquet file should not be split into multiple hdfs-blocks

...

It seems that this error is not a fatal one and Impala is able to get the query results, what might be the cause and how to avoid this Error?

Thanks!

TomasTF · ‎11-07-2014

I solved it with increasing the block size to the largest possible value of the partition, so when one partition is always less than 800MB I set the block size for this table to 1GB, and the warnings do not appear any more.

T.

View solution in original post

Harsh J · ‎07-20-2014

How large are your Parquet input files?

If you are copying your files around, have you ensured following the block size preservation method mentioned at http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Im...

Within Hive, you can perhaps "set dfs.blocksize=1g;" before issuing the queries to create the files.

TomasTF · ‎10-30-2014

Had the same issue, created a partitioned table stored as parquet in Hive, and loaded with data.

Then whe running the query in Impala got the same error message.

I tried these settings in Hive before running the insert, but the files produced are greater than the HDFS block size (128MB)

SET parquet.block.size=128000000;
SET dfs.blocksize=128000000;

Can anybody give an advice?

Tomas

TomasTF · ‎11-07-2014

I solved it with increasing the block size to the largest possible value of the partition, so when one partition is always less than 800MB I set the block size for this table to 1GB, and the warnings do not appear any more.

T.

BrockOwen · ‎12-18-2014

How can this be done when writing data from a Pig script?

James K · ‎11-23-2015

I'm running into something similar. I'm on 5.4.2 building tables with have then analyzing with Impala and I get the same warnings, although the queries execute ok.

Can you please share with me what you scripted to make "when one partition is always less than 800MB I set the block size for this table to 1GB" as you mention in your post?

Cloudera Community

Support Questions

ERROR: Parquet file should not be split into multiple hdfs-blocks

Build and use Parquet-tools to read parquet files

MergeRecord generates multiple files

Split one Nifi flow file into Multiple flow file b...

Read SAS files into parquet using nifi

HDFS checklist for identifying missing/corrupt blo...

Split CSV between Multiple Records in Apache NIFI

split GPU for multiple users

How Region Split works in HBase.

Split data into multiple files using NIFI based on...

Cannot read parquet files