Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Parquet files should not be split into multiple hdfs-blocks problem and strange record count issue

avatar
Contributor

Hello

 

I am trying to import parquet tables from another Cloudera Impala implementation to my Cloudera Impala

 

--> I am getting parquet tables via sftp

--> I am copying all parquet files into proper impala table directory like /grid1/hive/warehouse/<database>/<importedTable> without any error/warning

--> I am creating required partition structure with alter table <importedTable> add partition (..) without any error/warning

--> I am applying refresh <importedTable> command without any error/warning

--> I could see new partitions in (show partition <importedTable> command) without any error/warning

-->  I am applying above procedure for all tables

--> When I tried to access records in the table I got following warning "WARNINGS: Parquet files should not be split into multiple hdfs-blocks"

 

I am using gzip compression on my tables but imported tables have default settings. So I have another database with gzipped data. Therefore I am copying data from imported table to gzipped table with following command

 

set compression_codec=gzip without any error/warning

insert into <gzippedTable> partition (<part1=value1, part2=value2) select field1, field3, field4 ...... from <importedTable> where <partitioned column1=value1, partitioned column2=value2) without any error/warning

 

When I compare record counts for the partition both gzippedtable and imported table, there is a differences like following output

[host03:21000] > select  count (*) from importedTable where logdate=20160401;

Query: select count (*) from importedTable where logdate=20160401

+-----------+

| count(*)  |

+-----------+

| 101565867 |

+-----------+

WARNINGS: Parquet files should not be split into multiple hdfs-blocks. file=hdfs://host01:8020/grid1/hive/warehouse/<database>/importedTable/partitionedColumn=value1/logdate=20160401/51464233716089fd-295e6694028850a0_1358598818_data.0.parq (1 of 94 similar)

 

Fetched 1 row(s) in 0.96s

[host03:21000] > select  count (*) from gzippedTable where logdate=20160401;

Query: select count (*) from gzippedTable where logdate=20160401

+-----------+

| count(*)  |

+-----------+

| 123736525 |

+-----------+

Fetched 1 row(s) in 0.92s

 

So how can I fix "WARNINGS: Parquet files should not be split into multiple hdfs-blocks" and why I am getting different record counts after applying above procedure.

Is record count differences related with multiple hdfs-blocks warning ?

 

Thanks

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Cloudera Employee
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

Hello 

 

So sorry for delayed update. 

 

invalidate metadata
invalidate metadata tablename and then
refresh tablename 

 

 

commands have solved my problem. Source parquet tables and gzipped target tables have same records in their partitions. I am still getting "split into multiple hdfs-blocks problem" warnings but it looks like it does not any impact on my record count issue.

 

BTW : The link that you provided is very good 

 

Thanks for your response 

avatar

Not able to open this link :   http://ingest.tips/2015/01/31/parquet-row-group-size/

can you please check and repost it please ?