Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HIVE Regex extract taking time

Solved Go to solution

HIVE Regex extract taking time

New Contributor

Part of my JSON File:

X_INF_RequestID=ashg7665hjh-as09-0ahgshd5767@ER X_Requested_With=XMLHttpRequest

Little background of data:

Dataset used: 1.5GB (189276 records)

Input HIVE table: JSON format with gzip compressed

Partitioned: Yes, it is partitioned and 21 partitions have been created for the above data.

I have 2 concerns here:

1. I'm using get_json_object and regexp_extract to extract the data for X_INF_RequestID and X_Requested_With. When I run the select query on the above dataset, the X_INF_RequestID is taking 26 secs to complete, but the X_Requested_With is taking way too long to complete (~1.5hrs). Is there anything that my regex is taking such a long time?

Query used:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_Requested_With=(\\w(.+?)\\s)',2) as X_Requested_With FROM tablename;

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_INF_RequestID=(\\w(.+?)\\s)',2) as X_INF_RequestID FROM tablename;

Note: One observation from the data is the X_Requested_With is a combination of upper and lower case alphabets. How do I ignore case in Hive?

2. My blocksize is 128MB, which implies that for 1.5GB of data there should be 12 mappers running. But for every select query above it always triggers 21 mappers (which is equal to the number of partitions). Is the any reason behind this?

Any help on this is greatly appreciated. Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HIVE Regex extract taking time

Expert Contributor

@Anusha Akula,

Let me start with #2.

Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).

Regarding #1.

Regex you are using, is a bit complicated... Try to reduce its complexity:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;
2 REPLIES 2

Re: HIVE Regex extract taking time

Expert Contributor

@Anusha Akula,

Let me start with #2.

Number of mappers is defined by number of input splits. Since you have 21 partitions, you have not less than 21 files, which determines the number of mappers you get (actually, you have exactly 21 files: 1 under each partition).

Regarding #1.

Regex you are using, is a bit complicated... Try to reduce its complexity:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'.*X_Requested_With=(\\w.+?)\\s',1) as X_INF_RequestID FROM tablename;

Re: HIVE Regex extract taking time

New Contributor

Thanks Ed! Appreciate your help. Ended up using the following which is almost similar to yours:

SELECT REGEXP_EXTRACT((get_json_object(line,'$.result._raw')),'X_Requested_With=(\\w.+?)\\s',1)as X_INF_RequestID FROM tablename;

Don't have an account?
Coming from Hortonworks? Activate your account here