Created 04-18-2017 05:57 PM
Part of my JSON File:
X_INF_RequestID=ashg7665hjh-as09-0ahgshd5767@ER X_Requested_With=XMLHttpRequest
Little background of data:
Dataset used: 1.5GB (189276 records)
Input HIVE table: JSON format with gzip compressed
Partitioned: Yes, it is partitioned and 21 partitions have been created for the above data.
I have 2 concerns here:
1. I'm using get_json_object and regexp_extract to extract the data for X_INF_RequestID and X_Requested_With. When I run the select query on the above dataset, the X_INF_RequestID is taking 26 secs to complete, but the X_Requested_With is taking way too long to complete (~1.5hrs). Is there anything that my regex is taking such a long time?
Query used:
SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_Requested_With=(\\w(.+?)\\s)',2) as X_Requested_With FROM tablename;
SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_INF_RequestID=(\\w(.+?)\\s)',2) as X_INF_RequestID FROM tablename;
Note: One observation from the data is the X_Requested_With is a combination of upper and lower case alphabets. How do I ignore case in Hive?
2. My blocksize is 128MB, which implies that for 1.5GB of data there should be 12 mappers running. But for every select query above it always triggers 21 mappers (which is equal to the number of partitions). Is the any reason behind this?
Any help on this is greatly appreciated. Thanks in advance!
Created 04-19-2017 07:22 PM
Let me start with #2.
Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).
Regarding #1.
Regex you are using, is a bit complicated... Try to reduce its complexity:
SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;
Created 04-19-2017 07:22 PM
Let me start with #2.
Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).
Regarding #1.
Regex you are using, is a bit complicated... Try to reduce its complexity:
SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;
Created 04-20-2017 02:48 AM
Thanks Ed! Appreciate your help. Ended up using the following which is almost similar to yours:
SELECT REGEXP_EXTRACT((get_json_object(line,'$.result._raw')),'X_Requested_With=(\\w.+?)\\s',1)as X_INF_RequestID FROM tablename;