- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Create Hive Table from JSON Files
- Labels:
-
Apache Hive
-
Apache Impala
-
Cloudera Manager
-
HDFS
Created on ‎01-25-2018 04:52 AM - edited ‎09-16-2022 05:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am trying to create a HIVE table from JSON file. Here , JSON file is extracted from Cloudera Manager (JSON file Content: IMPALA query report).
We use CDH5.9
Could you please help me on how to create a hive/impala table which reads the data from JSON file as underlying file?
JSON file content will look like below,
{
"queries" : [ {
"queryId" : "75462e3e4c5d7694:37c50ff300000000",
"statement" : "select * from XXXXXXXXXXXXX",
"queryType" : "QUERY",
"queryState" : "EXCEPTION",
"startTime" : "2018-01-22T15:51:24.231Z",
"endTime" : "2018-01-22T16:33:27.962Z",
"rowsProduced" : 0,
"attributes" : {
"thread_storage_wait_time" : "1",
"session_id" : "fd49d2f82ac32576:bcb0c36d65a6d98f",
"hdfs_bytes_read_remote_percentage" : "0",
"stats_missing" : "false",
"thread_total_time" : "361",
"thread_network_send_wait_time_percentage" : "0",
"delegated_user" : "XXXXXX",
"thread_network_receive_wait_time_percentage" : "9",
"network_address" : "XXXXXXXXXXX:38571",
"pool" : "default-pool",
"hdfs_bytes_read_from_cache" : "0",
"hdfs_bytes_read_local_percentage" : "100",
"hdfs_scanner_average_bytes_read_per_second" : "0.0",
"hdfs_bytes_read_from_cache_percentage" : "0",
"bytes_streamed" : "12535",
"session_type" : "HIVESERVER2",
"cm_cpu_milliseconds" : "326.997635",
"thread_network_receive_wait_time" : "31",
"estimated_per_node_peak_memory" : "33554432",
"query_status" : "Query 75462e3e4c5d7694:37c50ff300000000 expired due to client inactivity (timeout is 10m)\n",
"thread_storage_wait_time_percentage" : "1",
"memory_aggregate_peak" : "574801.92",
"admission_result" : "Admitted immediately",
"hdfs_average_scan_range" : "229336.0",
"planning_wait_time" : "7",
"oom" : "false",
"memory_accrual" : "8.015872E7",
"memory_spilled" : "0",
"thread_cpu_time_percentage" : "90",
"admission_wait" : "0",
"file_formats" : "TEXT/NONE",
"hdfs_bytes_read_local" : "229336",
"hdfs_bytes_read_short_circuit" : "229336",
"hdfs_bytes_read_short_circuit_percentage" : "100",
"planning_wait_time_percentage" : "0",
"client_fetch_wait_time" : "2523686",
"client_fetch_wait_time_percentage" : "100",
"memory_per_node_peak_node" : "XXXXXX:22000",
"memory_per_node_peak" : "574801.92",
"connected_user" : "hue/XXX@xx",
"hdfs_bytes_read_remote" : "0",
"thread_network_send_wait_time" : "0",
"impala_version" : "impalad version 2.7.0-cdh5.9.2 RELEASE (build 2f7871169d894fab16f8a2fb99f2e34f0df8763d)",
"thread_cpu_time" : "326",
"hdfs_bytes_read" : "229336"
},
"user" : "XXXXXX",
"coordinator" : {
"hostId" : "XXXXXXXX"
},
"detailsAvailable" : true,
"database" : "XXXXX",
"durationMillis" : 2523731
} ],
"warnings" : [ ]
}
Thanks,
Venkatesh Kumar
Created ‎01-25-2018 07:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-11-2018 06:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We recommend using the JsonSerDe that comes with Hive.
You will have to install the JAR file into the Hive auxiliary directory. The JAR file is hive-hcatalog-core.jar and can be found in several places within the CDH distribution.
https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_hive_udf.html
Created ‎08-02-2018 06:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This process will become easier in a future version of CDH.
Created ‎08-02-2018 06:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
David @David M.,
Actually, I'm looking to get the Impala logs with a querytext, start time, end time, memory, username, etc.... for tracking the user queries and to create the live dashboards like Cloudera Navigator but with free of cost.
We have Spark or UDF to create the table from JSON in Hive.
>>> df = sqlContext.read.json ("/user/venkata/lineage.json")
>>> df.saveAsTable("secure.lineage")
Thanks,
Venkat
Created on ‎08-01-2018 12:23 PM - edited ‎08-01-2018 12:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kumar @Venkatesh_Kumar,
How did you get the IMPALA query report? I'm trying to download the Impala queries, So Could you help me.
Is there any location or way to download it?
I found the /var/log/impalad/lineage/ location for logs but I'm not able to stream those logs to table.
Thanks,
Venkat.
Created ‎08-02-2018 05:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can get it from Cloudera Manager.
Go to CM -> Impala -> Queries (tab) -> choose the time frame and click on 'export' button
Created ‎08-02-2018 05:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Saranvisa @saranvisa,
Thanks for the reply.
Not just for one time. Export is just manually thing but I would like to do streaming.
In my cluster, the minimum Impala queries : 300queries/30mints. huge. So....
Thanks,
Venkat
Created on ‎08-02-2018 11:42 AM - edited ‎08-02-2018 11:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh then you can use cloudera API
Link for cloudera API reference:
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_intro_api.html
Link for specific to service properties (you may need to explore little for impala query). It may help you
