Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIVE Regex extract taking time

avatar
Explorer

Part of my JSON File:

X_INF_RequestID=ashg7665hjh-as09-0ahgshd5767@ER X_Requested_With=XMLHttpRequest

Little background of data:

Dataset used: 1.5GB (189276 records)

Input HIVE table: JSON format with gzip compressed

Partitioned: Yes, it is partitioned and 21 partitions have been created for the above data.

I have 2 concerns here:

1. I'm using get_json_object and regexp_extract to extract the data for X_INF_RequestID and X_Requested_With. When I run the select query on the above dataset, the X_INF_RequestID is taking 26 secs to complete, but the X_Requested_With is taking way too long to complete (~1.5hrs). Is there anything that my regex is taking such a long time?

Query used:

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_Requested_With=(\\w(.+?)\\s)',2) as X_Requested_With FROM tablename;

SELECT REGEXP_EXTRACT((get_json_object(line, '$.result._raw')),'(.*)X_INF_RequestID=(\\w(.+?)\\s)',2) as X_INF_RequestID FROM tablename;

Note: One observation from the data is the X_Requested_With is a combination of upper and lower case alphabets. How do I ignore case in Hive?

2. My blocksize is 128MB, which implies that for 1.5GB of data there should be 12 mappers running. But for every select query above it always triggers 21 mappers (which is equal to the number of partitions). Is the any reason behind this?

Any help on this is greatly appreciated. Thanks in advance!

1 ACCEPTED SOLUTION

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Explorer

Thanks Ed! Appreciate your help. Ended up using the following which is almost similar to yours:

SELECT REGEXP_EXTRACT((get_json_object(line,'$.result._raw')),'X_Requested_With=(\\w.+?)\\s',1)as X_INF_RequestID FROM tablename;