Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIVE Regex extract taking time

avatar
Explorer

Part of my JSON File:

X_INF_RequestID=ashg7665hjh-as09-0ahgshd5767@ER X_Requested_With=XMLHttpRequest

Little background of data:

Dataset used: 1.5GB (189276 records)

Input HIVE table: JSON format with gzip compressed

Partitioned: Yes, it is partitioned and 21 partitions have been created for the above data.

I have 2 concerns here:

1. I'm using get_json_object and regexp_extract to extract the data for X_INF_RequestID and X_Requested_With. When I run the select query on the above dataset, the X_INF_RequestID is taking 26 secs to complete, but the X_Requested_With is taking way too long to complete (~1.5hrs). Is there anything that my regex is taking such a long time?

Query used:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_Requested_With=(\\w(.+?)\\s)',2) as X_Requested_With FROM tablename;

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_INF_RequestID=(\\w(.+?)\\s)',2) as X_INF_RequestID FROM tablename;

Note: One observation from the data is the X_Requested_With is a combination of upper and lower case alphabets. How do I ignore case in Hive?

2. My blocksize is 128MB, which implies that for 1.5GB of data there should be 12 mappers running. But for every select query above it always triggers 21 mappers (which is equal to the number of partitions). Is the any reason behind this?

Any help on this is greatly appreciated. Thanks in advance!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Anusha Akula,

Let me start with #2.

Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).

Regarding #1.

Regex you are using, is a bit complicated... Try to reduce its complexity:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

@Anusha Akula,

Let me start with #2.

Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).

Regarding #1.

Regex you are using, is a bit complicated... Try to reduce its complexity:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;

avatar
Explorer

Thanks Ed! Appreciate your help. Ended up using the following which is almost similar to yours:

SELECT REGEXP_EXTRACT((get_json_object(line,'$.result._raw')),'X_Requested_With=(\\w.+?)\\s',1)as X_INF_RequestID FROM tablename;