Reply
Explorer
Posts: 13
Registered: ‎09-30-2018
Accepted Solution

issue with hive table

I just created table store as parquet in hive&impala in the HUE, but when I exec query sql, I found the issue:

the same sql which could exec successfully in impala, but find no record in hive. Sometimes it gave results with many columns values NULL which have values in impala query. I'm not sure whether it is a bug in hive; I tried the both "set parquet.column.index.access=true" and "ALTER TABLE abc SET TBLPROPERTIES ("parquet.column.index.access"="true");"  the sql still doesn't work in hive. The sql with easy logic just like :

 

select * from dbname.tablename ad where
ad.month = '2018_11'
and ad.day = '10'
and length(ad.x)>0
and (length(ad.y)=0 or ad.y is null)
and ad.z = 'z'
limit 100

 

And when I want to change the format to orc. I justed found the impala couldn't recognize this format. So how could I do to solve this?

Highlighted
Cloudera Employee
Posts: 739
Registered: ‎03-23-2015

Re: issue with hive table

Couple of questions to understand the issue a bit more:

1. did you insert the data using Impala or hive?
2. did SELECT * from Hive works OK?
3. when query is running, do you see any warning or error messages from HS2 log?
4. Have you tried to remove some conditions to see if some of them might be the cause of the issue?
5. What version of CDH are you using?

Thanks
Explorer
Posts: 13
Registered: ‎09-30-2018

Re: issue with hive table

1.not insert, handle kafka data to hdfs,and use hive table&impala table to do the query

2.the select * from Hive works fine,but sometimes some columns of the table may be NULL in the results

3.the sql ran successfully with no errors

4.yes, I tried. I found the "set parquet.column.index.access=true" worked before I do the query, but it just took effect in the session lifetime(about 30mins); that could be a tmp solution. Is there any idea could solve this totally? Or I should set the parms every time before I want to do a query for this table.

5.the CDH version is 5.14

 

Any way, thanks for your kind reply.

Explorer
Posts: 13
Registered: ‎09-30-2018

Re: issue with hive table

Is it a parquet format problem in hive table? Actually when I use scala spark to connect this table, still have no result, seems could not recognize the value of the columns.

Cloudera Employee
Posts: 739
Registered: ‎03-23-2015

Re: issue with hive table

If "set parquet.column.index.access=true", this means the names of the columns in Parquet data is different compared with the columns defined in Hive.

The workaround is to set this at the table level:

ALTER TABLE parquet_table SET TBLPROPERTIES ('parquet.column.index.access'='true');

Hope above can help.
Explorer
Posts: 13
Registered: ‎09-30-2018

Re: issue with hive table

Thanks EricL first for your kind reply.

 

Yes, I know it's a tmp plan to set the parameter, I also use it at the table level, the whole create table sql as below:

CREATE TABLE `adjust_data_new`(
2 `app_version` string,
3 `tracker` string,
4 `network_name` string,
5 `campaign_name` string,
6 `is_organic` string,
7 `rejection_reason` string,
8 `click_time` string,
9 `impression_time` string,
10 `engagement_time` string,
11 `installed_at` string,
12 `install_begin_time` string,
13 `created_at` bigint,
14 `reattributed_at` string,
15 `time_to_uninstall` string,
16 `time_to_reinstall` string,
17 `uninstalled_at` string,
18 `reinstalled_at` string,
19 `adid` string,
20 `idfa` string,
21 `android_id` string,
22 `idfv` string,
23 `gps_adid` string,
24 `region` string,
25 `country` string,
26 `city` string,
27 `device_type` string,
28 `os_name` string,
29 `timezone` string,
30 `event` string,
31 `event_name` string,
32 `deeplink` string,
33 `fb_campaign_name` string,
34 `fb_campaign_id` string,
35 `fb_adgroup_id` string,
36 `fb_account_id` string,
37 `activity_kind` string,
38 `created_time` bigint)
39 PARTITIONED BY (
40 `month` string,
41 `day` string)
42 ROW FORMAT SERDE
43 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
44 STORED AS INPUTFORMAT
45 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
46 OUTPUTFORMAT
47 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
48 LOCATION
49 'hdfs://nameservice1/user/hive/warehouse/kafka_table.db/adjust_data_new'
50 TBLPROPERTIES (
51 'DO_NOT_UPDATE_STATS'='true',
52 'STATS_GENERATED_VIA_STATS_TASK'='true',
53 'last_modified_by'='hive',
54 'last_modified_time'='1542947483',
55 'numRows'='570367760',
56 'parquet.column.index.access'='true',
57 'spark.sql.create.version'='2.2 or prior',
58 'spark.sql.sources.schema.numPartCols'='2',
59 'spark.sql.sources.schema.numParts'='1',
60 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"app_version\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"tracker\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"network_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"campaign_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"is_organic\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"rejection_reason\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"click_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"impression_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"engagement_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"installed_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"install_begin_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"created_at\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"reattributed_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"time_to_uninstall\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"time_to_reinstall\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"uninstalled_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"reinstalled_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"adid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"idfa\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"android_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"idfv\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gps_adid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"region\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"city\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"device_type\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"os_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"timezone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"deeplink\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_campaign_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_campaign_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_adgroup_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_account_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"activity_kind\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"created_time\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
61 'spark.sql.sources.schema.partCol.0'='month',
62 'spark.sql.sources.schema.partCol.1'='day',
63 'totalSize'='24309750927',
64 'transient_lastDdlTime'='1542947483')

 

but when I do the query:

select count(*) from adjust_data_new

where month='2018_12'

and day='10'

and activity_kind='session'

it seems that the condition couldn't be recognized in hive table

 

in impala the result is:

count(*)

290430

 

but in hive the result is 0

 

I confused why the query by

 

select activity_kind from adjust_data_new

where month='2018_12'

and day='10' limit 100

 

could work fine in hive, and their are many 'session' in the result list.

 

 

Explorer
Posts: 13
Registered: ‎09-30-2018

Re: issue with hive table

[ Edited ]

Hi Ericl,

 

Finally we found the correct direction and got the solution as below:

 

change the hive-site.xml in system

 

add

 

<property>
    <name>parquet.column.index.access</name>
    <value>true</value>
    <description>set parquet index access</description>
</property>

 

then restart hive

 

Then everything is fine. This issue gave me deep impression,  hope this reply could help other guys who have the same issue like me or even guide a way to find the solution. Thanks in the end.