Support Questions

Find answers, ask questions, and share your expertise

Multi-delimiter in Hive using regex

avatar

I have a file which has data like follows:

provider_specialty,provider_specialty_desc

1,General,Practice|third column

2,General Surgery|third column

3,Allergy/Immunology|third column

4,Otolaryngology|third column

5,Anesthesiology|third column

6,Cardiology|third column

7,Dermatology|third column

8,Family Practice|third column

9,Interventional Pain Management|third column

It has 3 columns in which first column and second column is separated by ','. Second column and third column is separated by '|'.

Also few values in the second columns has delimiter as value. Is it possible to attain this in hive using multidelimiter. Could anyone help in providing the logic to create the table for the data provided above.

4 REPLIES 4

avatar
Guru

@Bala Vignesh N V

You will need to do a little data engineering to prep your data for the hive table ... basically, replacing the pipe | with a comma.

You can do this easily in pig by running this script:

a = LOAD '/sourcefilepath' as (fullrecord:chararray);  
b = FOREACH x generate REPLACE(fullrecord, '\\|', ',');
STORE b INTO '/targetfilepath ' USING PigStorage (',');

You could also do this pipe replace in sed before loading to hdfs. Pig is advantageous however because it will run in map-reduce or tez and be much faster (parallel processing) especially for large files.

The fact that you have some values that include the delimiter is a problem ... unless there is a clear pattern you will have to write a program that finds each record with too many delimiters and then write a script to replace these one by one (e.g replace 'new york, usa' with 'new york usa' . If you used pig, b = would have to be repeated for each such value with delim.

If you are unfamiliar with pig, this is a good tutorial to show how to implement the above https://hortonworks.com/tutorial/how-to-process-data-with-apache-pig/

avatar

@Greg Keys Thanks. Actually I have implemented already like the way which you have mentioned. But following this is easier for files with less no of columns. But I wanted to build a job which should be easier to handle even if we have large no of columns. I such case I came across multi-delimiter & regex. Wanted to know how it can be implemented.

avatar
Expert Contributor

@Bala Vignesh N V

This is a perfect use case for the Hive Regular Expression Serde. Below is an example external table that could read this. If you're not familiar with regular expressions take a look here. With this serde each capture group is a column and I've basically told it to look for any set of characters other than the | or , delimiters. For additional information about the Hive Regex Serde see here.

create external table example_table (
    col1 string,
    col2 string,
    col3 string
    )
    row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
    with serdeproperties ('input.regex'='^([^,\|]+),([^,\|]+),([^,\|]+)\|([^,\|]+)$')
    stored as textfile;

avatar

Thanks @Shawn Weeks