Support Questions

Find answers, ask questions, and share your expertise

How can i convert location field into 2 fields in Hive and how to store in hive Table ?

avatar
Contributor

Hi ,
I need to convert the location fields into 2 fields based on "=" , Means folder location in one column and partition details in another column

current scenario :

Location Partion Fields
hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456

required scenario :


Location Partion Fields
hdfs://servername/sds/sds/erg/ rownum=123/columnnumber=456

Can you please help on this..!!!

Thanks in advance..!!

1 ACCEPTED SOLUTION

avatar
Master Guru
@Satya Nittala

Use hive Substr function and specify start/end position that you needed for each field

hive> select substr('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',0,30) Location, substr('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',31) PartionFields;
+---------------------------------+------------------------------+--+
|            location             |        partionfields         |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Please refer to this link for more details regarding hive string functions documentation.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

View solution in original post

8 REPLIES 8

avatar
Master Guru
@Satya Nittala

Use hive Substr function and specify start/end position that you needed for each field

hive> select substr('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',0,30) Location, substr('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',31) PartionFields;
+---------------------------------+------------------------------+--+
|            location             |        partionfields         |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Please refer to this link for more details regarding hive string functions documentation.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

avatar
Contributor

@Shu Thanks for Quick response

I had total 10 Lakh records like that , with different data and different string length

Above mentioned code only satisfying with that one record only

I need a generic code for all data , It should satisfy with all records,

Can u please suggest..!!!

Thanks in advance..!!!

avatar
Master Guru
@Satya Nittala

Please add some sample records with different string lengths, so that we can prepare the query that satisfies them.

avatar
Contributor

@Shu

Please find the below sample data

hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/meta/data/ccdp/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgntrownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/lo/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/lo/ap/rownum=123/columnnumber=456

Length is getting increasing for location , so how can we convert based on "=" sign

Can u please suggest..!!!

Thanks in advance..!!!

avatar
Master Guru

@Satya Nittala

Thanks for the sample records..!!

One thing i commonly observed pattern is second from the last / after you need that to be in partitionfields column before second from last / needs to be in location column. Split based on = will not give the desired results but split on rownum= will satisfies.

Using Regexp_extract function:

1.using rownum= matching and extracting

hive> select regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*)(rownum=.*)",1) Location,regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*)(rownum=.*)",2) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+
<br>

2.Using forward slashes and extracting

hive> select regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*(?:\\/))(.*(?:\\/).*)",1) Location,regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*(?:\\/))(.*(?:\\/).*)",2) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Using Split Function:

1.Split the string based on rownum=

as we are doing split on rownum= so we are going to miss the rownum= in partitionfields data we need to concat rownum= to get the correct data in partitionfields column.

hive> select split('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"rownum=")[0] Location,concat('rownum=',split('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"rownum=")[1]) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Use the any of the above functions which best fits for your case if you still want to split on = then tweak the above regexp_extract (or) split functions to get the desired results.

avatar
Contributor

@Shu , Thanks for your quick Suggestions

Mentioned function is only work if location field is having "rownum="

i had a data without "rownum="

hdfs://servername/sds/sds/erg/tabledata=456/size=123/length=456
hdfs://servername/sds/sds/erg/meta/data/ccdp/hive _table=test/size=123/length=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/work_space=895/rownum=123/columnnumber=456

2.Using forward slashes and extracting method also not working if location fields changes with different data


Like this this records many or there with different variables using for partitioning , I need partitioning fields in the another column , Starting '=" fields in the partition columns

Can you please help on this ..!!!

Thanks in advance...!!

avatar
Contributor

@Shu Please find the Sample data

hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303 hdfs://eualty/lob/bi/publish/ccdp/v1/pub/gss_attribute_details_pub/load_dt=20170322 hdfs://Test/lob/ebia/source/ebia/ccdp_genesys_fmr_router_core/load_date=2018-04-29 hdfs://eualty/lob/bi/source/ebia/ccdp_genesys_fmr_router_core/load_date=2018-04-30 hdfs://Test/lob/ebia/publish/ebia/interaction_fact_core/load_date=2017-12-16 hdfs://Test/lob/ebia/publish/ccdp/v1/core/interaction_fact_core/load_date=2017-12-21 hdfs://eualty/lob/bi/publish/ccdp/v1/pub/task_event_details_pub/load_date=2018-04-26/historical_intraday_record=I hdfs://Test/lob/bi/publish/pi/v1/pub/task_event_details_pub/load_date=2018-05-22/historical_intraday_record=I hdfs://eualty/lob/ebia/publish/bi/avaya_call_details_pub/year=2018/quarter=1/month=1 hdfs://eualty/lob/bi/publish/ebia/avaya_call_details_pub/year=2017/quarter=4/month=11 hdfs://Test/lob/ebia/project/pi/avaya_call_details_pub/year=2018/quarter=1/month=3 hdfs://Test/lob/bi/project/bi/avaya_call_details_pub/year=2017/quarter=3/month=8 hdfs://eualty/lob/ebia/project/ebia/avaya_call_details_pub/year=2018/quarter=2/month=5

avatar
Master Guru
@Satya Nittala

As the data is not in the consistent format i have 2 step method to populate field values correctly.

Step1:

Create an temporary table to store this intermediate step data
Split the data on = character and populate the data

hive> select split('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303',"=")[0] Location,split('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303',"=")[1] partitionfields;
+-----------------------------------------------------------------+------------------+--+
|                            location                             | partitionfields  |
+-----------------------------------------------------------------+------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt  | 20170303         |
+-----------------------------------------------------------------+------------------+--+

We are still missing load_dt in partitionfields column data

Step2:

select from temporary table data and insert into final table

hive> select regexp_extract("hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt","(.*)\\/",1) location,concat_ws("=",reverse(split(reverse('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt'), '/')[0]),"partitionfields") partitionfields;
+---------------------------------------------------------+--------------------------+--+
|                        location                         |     partitionfields      |
+---------------------------------------------------------+--------------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub  | load_dt=partitionfields  |
+---------------------------------------------------------+--------------------------+--+

In step2 first i'm extracting only data before last / i.e after / the data needs to go to partitionfields column, and by using concat_ws i'm joining partitionfields column data and using reverse function with split on / then extracting [0] position value, concatenating with partitionfields value.

(or)

By using two regular expression extracts to prepare the actual final data for the columns

hive> select regexp_extract("hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt","(.*)\\/",1) location,concat_ws("=",regexp_extract('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt', '(.*)\/(.*)',2),"partitionfields") partitionfields;
+---------------------------------------------------------+--------------------------+--+
|                        location                         |     partitionfields      |
+---------------------------------------------------------+--------------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub  | load_dt=partitionfields  |
+---------------------------------------------------------+--------------------------+--+


I hope this will work for your case correctly.