Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

External Table from Parquet folder returns empty result

avatar
New Contributor

Hi, I did search for similar issues but didn't find a perfect match for my case:

 

 

CREATE EXTERNAL TABLE processed_data LIKE PARQUET '/user/myuser/spark/proyecto1/date=2015-08-17/hour=21/part-00087-7a6eac40-0db4-4368-894c-72588f80dfd6.c000.snappy.parquet'

STORED AS PARQUET

LOCATION '/user/myuser/spark/proyecto1'

 

The table creation works, but any query to the table will return 0 rows.

 

Query: SHOW CREATE TABLE processed_data
+---------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                    |
+---------------------------------------------------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE josemidb.processed_data (                                                                           |
|   window STRUCT<start:TIMESTAMP,end:TIMESTAMP> COMMENT 'Inferred from Parquet file.',                                     |
|   terminating_cellid BIGINT COMMENT 'Inferred from Parquet file.',                                                        |
|   termination_id BIGINT COMMENT 'Inferred from Parquet file.',                                                            |
|   termination_reason BIGINT COMMENT 'Inferred from Parquet file.',                                                        |
|   numcalls BIGINT COMMENT 'Inferred from Parquet file.'                                                                   |
| )                                                                                                                         |
| STORED AS PARQUET                                                                                                         |
| LOCATION 'hdfs://myhost:8020/user/myuser/spark/proyecto1'                                    |
| TBLPROPERTIES ('COLUMN_STATS_ACCURATE'='false', 'numFiles'='18', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='67251') |
+---------------------------------------------------------------------------------------------------------------------------+

This shows numFiles = 18 which is right, but numRows = -1 😞

 

I think my problem is that I'm not handling partitions the right way. Can you help me with this?

 

Thank you.

1 ACCEPTED SOLUTION

avatar

1. You need to add the appropriate PARTITION clause to your CREATE EXTERNAL TABLE. Looks like it should be PARTITIONED BY (`date` STRING)

2. You need to run ALTER TABLE RECOVER PARTITIONS to discover all the partitions for the table

 

After that you should be good to go!

View solution in original post

4 REPLIES 4

avatar
Master Collaborator

Hi @messenjah00,

Can you share with us the proyecto1 table schema?
I think you must use PARTITIONED BY in the create statement.

avatar

1. You need to add the appropriate PARTITION clause to your CREATE EXTERNAL TABLE. Looks like it should be PARTITIONED BY (`date` STRING)

2. You need to run ALTER TABLE RECOVER PARTITIONS to discover all the partitions for the table

 

After that you should be good to go!

avatar
Cloudera Employee
It looks like the path to the parquet file that you've used to infer the schema of the table has multiple directories that correspond to partitions (e.g., date=2015-08-17/hour=21).

Create the table with the schema and partitioning specified, then use alter partition to add the per-partition files (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet.html)

The -1 shown is probably from not running compute stats.

avatar
New Contributor

Thank you all,

I added..

PARTITIONED BY (`date` STRING, hour INT)

 ..to the CREATE External TABLE sentence and then..

ALTER TABLE processed_data_p ADD PARTITION (`date`="2015-08-19", hour=20) LOCATION '/user/myuser/spark/proyecto1/date=2015-08-19/hour=20';

One option to avoid adding all partitions manually is:

alter table processed_data_p recover partitions;