Support Questions

Find answers, ask questions, and share your expertise

Cannot read parquet files

avatar
Contributor

Hello,

 

I have a table that is pointing to a location in HDFS, like this:

 

# col_name              data_type               comment

treceived               string
unix_tstamp             bigint
tryear                  int
trmonth                 int
trday                   int
trhour                  int
trminute                int
trsec                   int
tdur                    float
sip                     string
dip                     string
sport                   int
dport                   int
proto                   string
flag                    string
fwd                     int
stos                    int
ipkt                    bigint
ibyt                    bigint
opkt                    bigint
obyt                    bigint
input                   int
output                  int
sas                     int
das                     int
dtos                    int
dir                     int
rip                     string

# Partition Information
# col_name              data_type               comment

y                       int
m                       int
d                       int
h                       int

# Detailed Table Information
Database:               spotdb
Owner:                  spot
CreateTime:             Thu Feb 23 16:41:20 CLST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://HDFS-namenode:8020/user/spot/flow/hive
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        avro.schema.literal     {\n    \"type\":   \"record\"\n  , \"name\":   \"FlowRecord\"\n  , \"namespace\" : \"com.cloudera.accelerators.flows.avro\"\n  , \"fields\": [\n        {\"name\": \"treceived\",                  \"type\":[\"string\",   \"null\"]}\n     ,  {\"name\": \"unix_tstamp\",                 \"type\":[\"long\",     \"null\"]}\n     ,  {\"name\": \"tryear\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"trmonth\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"trday\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"trhour\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"trminute\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"trsec\",                    \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"tdur\",                    \"type\":[\"float\",   \"null\"]}\n     ,  {\"name\": \"sip\",              \"type\":[\"string\",   \"null\"]}\n     ,  {\"name\": \"sport\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"dip\",         \"type\":[\"string\",   \"null\"]}\n     ,  {\"name\": \"dport\",        \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"proto\",            \"type\":[\"string\",   \"null\"]}\n     ,  {\"name\": \"flag\",            \"type\":[\"string\",   \"null\"]}\n     ,  {\"name\": \"fwd\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"stos\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"ipkt\",                 \"type\":[\"bigint\",   \"null\"]}\n     ,  {\"name\": \"ibytt\",                 \"type\":[\"bigint\",   \"null\"]}\n     ,  {\"name\": \"opkt\",                 \"type\":[\"bigint\",   \"null\"]}\n     ,  {\"name\": \"obyt\",                 \"type\":[\"bigint\",   \"null\"]}\n     ,  {\"name\": \"input\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"output\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"sas\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"das\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"dtos\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"dir\",                 \"type\":[\"int\",   \"null\"]}\n     ,  {\"name\": \"rip\",                    \"type\":[\"string\",   \"null\"]}\n  ]\n}
        transient_lastDdlTime   1487878880

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             ,
        serialization.format    ,

But when I select that table it says that it's empty. I checked that HDFS loacation and it has parquet files:

sudo -u hdfs hdfs dfs -ls /user/spot/flow/hive/y=2017/m=02/d=23/h=23
Found 12 items
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:05 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:10 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_1
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:55 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_10
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 21:00 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_11
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:15 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_2
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:20 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_3
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:25 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_4
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:30 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_5
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:35 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_6
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:40 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_7
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:45 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_8
-rwxr-xr-x   3 spot supergroup        440 2017-02-23 20:50 /user/spot/flow/hive/y=2017/m=02/d=23/h=23/000000_0_copy_9

What did i do wrong?

 

Regards,

Joaquín Silva

1 ACCEPTED SOLUTION

avatar
Contributor
I found the issue, the parquet files where empty.

View solution in original post

6 REPLIES 6

avatar
Champion
What is the output of 'show partitions <tablename>;'?

I suspect that HMS does not know about the partitions.

If you don't see any partitions in the output you can add them manually using one of the below commands.

This will add all missing partition that it finds in HDFS.

msck repair <tablename>;

This will add one at a time:

alter table <tablename> add partition (y=2017,m=02,d=23,h=23);

avatar
Contributor

This is the result:

show partitions spotdb.flow;
OK
y=2017/m=02/d=23/h=20
y=2017/m=02/d=23/h=21
y=2017/m=02/d=23/h=22
y=2017/m=02/d=23/h=23
y=2017/m=02/d=24/h=00
y=2017/m=02/d=24/h=01
y=2017/m=02/d=24/h=02
y=2017/m=02/d=24/h=03
y=2017/m=02/d=24/h=04
y=2017/m=02/d=24/h=05
y=2017/m=02/d=24/h=06
y=2017/m=02/d=24/h=07
y=2017/m=02/d=24/h=08
y=2017/m=02/d=24/h=09
y=2017/m=02/d=24/h=10
y=2017/m=02/d=24/h=11
y=2017/m=02/d=24/h=12
y=2017/m=02/d=24/h=13
y=2017/m=02/d=24/h=14
y=2017/m=02/d=24/h=15
y=2017/m=02/d=24/h=16
y=2017/m=02/d=24/h=17
y=2017/m=02/d=24/h=18
y=2017/m=02/d=24/h=19
y=2017/m=02/d=24/h=20

As I can see it recognize the partitions.

 

avatar
Champion
Try to read the file directly. There are other parquet tool options as well to validate the schema, etc.

/opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools cat /path/to/parquet.file

avatar
Contributor

This is the result:

/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/parquet/bin/parquet-tools cat /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0
File /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0 does not exist

But the file exists

sudo -u hdfs hdfs dfs -ls /user/spot/flow/hive/y=2017/m=02/d=24/h=20
Found 12 items
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:05 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:10 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_1
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:55 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_10
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 18:00 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_11
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:15 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_2
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:20 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_3
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:25 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_4
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:30 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_5
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:35 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_6
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:40 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_7
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:45 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_8
-rwxr-xr-x   3 spot supergroup        440 2017-02-24 17:50 /user/spot/flow/hive/y=2017/m=02/d=24/h=20/000000_0_copy_9

avatar
Champion
Are sure they parquet files? I haven't used the parquet tools in a bit so I don't know the behavior of trying to read a non-parquet file with them. I could see it view it as not existing if that is the case and it would also be why you can see the data through the table.

How was this data added to HDFS/table?
Is this a Hive or Impala table?

avatar
Contributor
I found the issue, the parquet files where empty.