Created on 11-26-2018 01:31 AM - edited 09-16-2022 06:55 AM
I just created table store as parquet in hive&impala in the HUE, but when I exec query sql, I found the issue:
the same sql which could exec successfully in impala, but find no record in hive. Sometimes it gave results with many columns values NULL which have values in impala query. I'm not sure whether it is a bug in hive; I tried the both "set parquet.column.index.access=true" and "ALTER TABLE abc SET TBLPROPERTIES ("parquet.column.index.access"="true");" the sql still doesn't work in hive. The sql with easy logic just like :
select * from dbname.tablename ad where
ad.month = '2018_11'
and ad.day = '10'
and length(ad.x)>0
and (length(ad.y)=0 or ad.y is null)
and ad.z = 'z'
limit 100
And when I want to change the format to orc. I justed found the impala couldn't recognize this format. So how could I do to solve this?
Created on 12-11-2018 01:44 AM - edited 12-11-2018 01:45 AM
Hi Ericl,
Finally we found the correct direction and got the solution as below:
change the hive-site.xml in system
add
<property> <name>parquet.column.index.access</name> <value>true</value> <description>set parquet index access</description> </property>
then restart hive
Then everything is fine. This issue gave me deep impression, hope this reply could help other guys who have the same issue like me or even guide a way to find the solution. Thanks in the end.
Created 11-26-2018 03:18 AM
Created 11-26-2018 11:27 PM
1.not insert, handle kafka data to hdfs,and use hive table&impala table to do the query
2.the select * from Hive works fine,but sometimes some columns of the table may be NULL in the results
3.the sql ran successfully with no errors
4.yes, I tried. I found the "set parquet.column.index.access=true" worked before I do the query, but it just took effect in the session lifetime(about 30mins); that could be a tmp solution. Is there any idea could solve this totally? Or I should set the parms every time before I want to do a query for this table.
5.the CDH version is 5.14
Any way, thanks for your kind reply.
Created 12-06-2018 01:51 AM
Is it a parquet format problem in hive table? Actually when I use scala spark to connect this table, still have no result, seems could not recognize the value of the columns.
Created 12-09-2018 07:24 PM
Created 12-09-2018 08:16 PM
Thanks EricL first for your kind reply.
Yes, I know it's a tmp plan to set the parameter, I also use it at the table level, the whole create table sql as below:
CREATE TABLE `adjust_data_new`(
2 `app_version` string,
3 `tracker` string,
4 `network_name` string,
5 `campaign_name` string,
6 `is_organic` string,
7 `rejection_reason` string,
8 `click_time` string,
9 `impression_time` string,
10 `engagement_time` string,
11 `installed_at` string,
12 `install_begin_time` string,
13 `created_at` bigint,
14 `reattributed_at` string,
15 `time_to_uninstall` string,
16 `time_to_reinstall` string,
17 `uninstalled_at` string,
18 `reinstalled_at` string,
19 `adid` string,
20 `idfa` string,
21 `android_id` string,
22 `idfv` string,
23 `gps_adid` string,
24 `region` string,
25 `country` string,
26 `city` string,
27 `device_type` string,
28 `os_name` string,
29 `timezone` string,
30 `event` string,
31 `event_name` string,
32 `deeplink` string,
33 `fb_campaign_name` string,
34 `fb_campaign_id` string,
35 `fb_adgroup_id` string,
36 `fb_account_id` string,
37 `activity_kind` string,
38 `created_time` bigint)
39 PARTITIONED BY (
40 `month` string,
41 `day` string)
42 ROW FORMAT SERDE
43 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
44 STORED AS INPUTFORMAT
45 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
46 OUTPUTFORMAT
47 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
48 LOCATION
49 'hdfs://nameservice1/user/hive/warehouse/kafka_table.db/adjust_data_new'
50 TBLPROPERTIES (
51 'DO_NOT_UPDATE_STATS'='true',
52 'STATS_GENERATED_VIA_STATS_TASK'='true',
53 'last_modified_by'='hive',
54 'last_modified_time'='1542947483',
55 'numRows'='570367760',
56 'parquet.column.index.access'='true',
57 'spark.sql.create.version'='2.2 or prior',
58 'spark.sql.sources.schema.numPartCols'='2',
59 'spark.sql.sources.schema.numParts'='1',
60 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"app_version\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"tracker\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"network_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"campaign_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"is_organic\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"rejection_reason\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"click_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"impression_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"engagement_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"installed_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"install_begin_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"created_at\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"reattributed_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"time_to_uninstall\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"time_to_reinstall\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"uninstalled_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"reinstalled_at\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"adid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"idfa\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"android_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"idfv\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gps_adid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"region\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"city\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"device_type\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"os_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"timezone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"deeplink\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_campaign_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_campaign_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_adgroup_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"fb_account_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"activity_kind\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"created_time\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
61 'spark.sql.sources.schema.partCol.0'='month',
62 'spark.sql.sources.schema.partCol.1'='day',
63 'totalSize'='24309750927',
64 'transient_lastDdlTime'='1542947483')
but when I do the query:
select count(*) from adjust_data_new
where month='2018_12'
and day='10'
and activity_kind='session'
it seems that the condition couldn't be recognized in hive table
in impala the result is:
count(*)
290430
but in hive the result is 0
I confused why the query by
select activity_kind from adjust_data_new
where month='2018_12'
and day='10' limit 100
could work fine in hive, and their are many 'session' in the result list.
Created on 12-11-2018 01:44 AM - edited 12-11-2018 01:45 AM
Hi Ericl,
Finally we found the correct direction and got the solution as below:
change the hive-site.xml in system
add
<property> <name>parquet.column.index.access</name> <value>true</value> <description>set parquet index access</description> </property>
then restart hive
Then everything is fine. This issue gave me deep impression, hope this reply could help other guys who have the same issue like me or even guide a way to find the solution. Thanks in the end.