Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can i convert location field into 2 fields in Hive and how to store in hive Table ?

avatar
Contributor

Hi ,
I need to convert the location fields into 2 fields based on "=" , Means folder location in one column and partition details in another column

current scenario :

Location Partion Fields
hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456

required scenario :


Location Partion Fields
hdfs://servername/sds/sds/erg/ rownum=123/columnnumber=456

Can you please help on this..!!!

Thanks in advance..!!

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

@Shu Thanks for Quick response

I had total 10 Lakh records like that , with different data and different string length

Above mentioned code only satisfying with that one record only

I need a generic code for all data , It should satisfy with all records,

Can u please suggest..!!!

Thanks in advance..!!!

avatar
Master Guru
@Satya Nittala

Please add some sample records with different string lengths, so that we can prepare the query that satisfies them.

avatar
Contributor

@Shu

Please find the below sample data

hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/meta/data/ccdp/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgntrownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/lo/rownum=123/columnnumber=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/lo/ap/rownum=123/columnnumber=456

Length is getting increasing for location , so how can we convert based on "=" sign

Can u please suggest..!!!

Thanks in advance..!!!

avatar
Master Guru

@Satya Nittala

Thanks for the sample records..!!

One thing i commonly observed pattern is second from the last / after you need that to be in partitionfields column before second from last / needs to be in location column. Split based on = will not give the desired results but split on rownum= will satisfies.

Using Regexp_extract function:

1.using rownum= matching and extracting

hive> select regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*)(rownum=.*)",1) Location,regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*)(rownum=.*)",2) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+
<br>

2.Using forward slashes and extracting

hive> select regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*(?:\\/))(.*(?:\\/).*)",1) Location,regexp_extract('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"(.*(?:\\/))(.*(?:\\/).*)",2) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Using Split Function:

1.Split the string based on rownum=

as we are doing split on rownum= so we are going to miss the rownum= in partitionfields data we need to concat rownum= to get the correct data in partitionfields column.

hive> select split('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"rownum=")[0] Location,concat('rownum=',split('hdfs://servername/sds/sds/erg/rownum=123/columnnumber=456',"rownum=")[1]) partitionfields;
+---------------------------------+------------------------------+--+
|            location             |       partitionfields        |
+---------------------------------+------------------------------+--+
| hdfs://servername/sds/sds/erg/  | rownum=123/columnnumber=456  |
+---------------------------------+------------------------------+--+

Use the any of the above functions which best fits for your case if you still want to split on = then tweak the above regexp_extract (or) split functions to get the desired results.

avatar
Contributor

@Shu , Thanks for your quick Suggestions

Mentioned function is only work if location field is having "rownum="

i had a data without "rownum="

hdfs://servername/sds/sds/erg/tabledata=456/size=123/length=456
hdfs://servername/sds/sds/erg/meta/data/ccdp/hive _table=test/size=123/length=456
hdfs://servername/sds/sds/erg/calims/notes/ccdp/meta/mgnt/work_space=895/rownum=123/columnnumber=456

2.Using forward slashes and extracting method also not working if location fields changes with different data


Like this this records many or there with different variables using for partitioning , I need partitioning fields in the another column , Starting '=" fields in the partition columns

Can you please help on this ..!!!

Thanks in advance...!!

avatar
Contributor

@Shu Please find the Sample data

hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303 hdfs://eualty/lob/bi/publish/ccdp/v1/pub/gss_attribute_details_pub/load_dt=20170322 hdfs://Test/lob/ebia/source/ebia/ccdp_genesys_fmr_router_core/load_date=2018-04-29 hdfs://eualty/lob/bi/source/ebia/ccdp_genesys_fmr_router_core/load_date=2018-04-30 hdfs://Test/lob/ebia/publish/ebia/interaction_fact_core/load_date=2017-12-16 hdfs://Test/lob/ebia/publish/ccdp/v1/core/interaction_fact_core/load_date=2017-12-21 hdfs://eualty/lob/bi/publish/ccdp/v1/pub/task_event_details_pub/load_date=2018-04-26/historical_intraday_record=I hdfs://Test/lob/bi/publish/pi/v1/pub/task_event_details_pub/load_date=2018-05-22/historical_intraday_record=I hdfs://eualty/lob/ebia/publish/bi/avaya_call_details_pub/year=2018/quarter=1/month=1 hdfs://eualty/lob/bi/publish/ebia/avaya_call_details_pub/year=2017/quarter=4/month=11 hdfs://Test/lob/ebia/project/pi/avaya_call_details_pub/year=2018/quarter=1/month=3 hdfs://Test/lob/bi/project/bi/avaya_call_details_pub/year=2017/quarter=3/month=8 hdfs://eualty/lob/ebia/project/ebia/avaya_call_details_pub/year=2018/quarter=2/month=5

avatar
Master Guru
@Satya Nittala

As the data is not in the consistent format i have 2 step method to populate field values correctly.

Step1:

Create an temporary table to store this intermediate step data
Split the data on = character and populate the data

hive> select split('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303',"=")[0] Location,split('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt=20170303',"=")[1] partitionfields;
+-----------------------------------------------------------------+------------------+--+
|                            location                             | partitionfields  |
+-----------------------------------------------------------------+------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt  | 20170303         |
+-----------------------------------------------------------------+------------------+--+

We are still missing load_dt in partitionfields column data

Step2:

select from temporary table data and insert into final table

hive> select regexp_extract("hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt","(.*)\\/",1) location,concat_ws("=",reverse(split(reverse('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt'), '/')[0]),"partitionfields") partitionfields;
+---------------------------------------------------------+--------------------------+--+
|                        location                         |     partitionfields      |
+---------------------------------------------------------+--------------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub  | load_dt=partitionfields  |
+---------------------------------------------------------+--------------------------+--+

In step2 first i'm extracting only data before last / i.e after / the data needs to go to partitionfields column, and by using concat_ws i'm joining partitionfields column data and using reverse function with split on / then extracting [0] position value, concatenating with partitionfields value.

(or)

By using two regular expression extracts to prepare the actual final data for the columns

hive> select regexp_extract("hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt","(.*)\\/",1) location,concat_ws("=",regexp_extract('hdfs://Test/lob/ebia/publish/gss_attribute_details_pub/load_dt', '(.*)\/(.*)',2),"partitionfields") partitionfields;
+---------------------------------------------------------+--------------------------+--+
|                        location                         |     partitionfields      |
+---------------------------------------------------------+--------------------------+--+
| hdfs://Test/lob/ebia/publish/gss_attribute_details_pub  | load_dt=partitionfields  |
+---------------------------------------------------------+--------------------------+--+


I hope this will work for your case correctly.