Support Questions

Find answers, ask questions, and share your expertise

how to split name based on special character in hive

avatar

I have a table with column name "fullname" which I would like to split into four columns (LAST_NAME, TITLE, FIRST_NAME, MIDDLE_NAME) while loading into another table.

If it is a person's name, the convention is LAST_NAME TITLE! FIRST_NAME MIDDLE_NAME:

for example I have "abc xxx! def ghi" in my table. This should be split and loaded in to 4 different columns.

the name should be split into LAST, FIRST, MID. Word Preceding the exclamation is TITLE (like xxx here).

If it is an organisation name ("Names of organisations are ended by an exclamation mark) : I should move the entire string to FIRST_NAME.

for example: abc systems! should be loaded to FIRST_NAME.

current state:

fullname
abc xxx! def ghi
abc systems!

expected result:

LastTitleFirstMiddle
abcxxxdefghi
abc systems

Can someone help how to write a query for the above requirement? Thanks in Advance!

3 REPLIES 3

avatar
Expert Contributor

You can use below sql to convert the content as required

0: jdbc:hive2://localhost> select * from temptable;
+---------------------+--+
| temptable.fullname  |
+---------------------+--+
| abc xxx! def ghi    |
| abc systems!        |
+---------------------+--+
0: jdbc:hive2://localhost> select a[0] as LAST, substr(a[1], 0, length(a[1])-1) as TITLE, a[2] as FIRST, a[3] as MID from (
0: jdbc:hive2://localhost> select case
0: jdbc:hive2://localhost> when instr(fullname, "!" )=length(fullname) then array('', '!', substr(fullname, 0, length(fullname)-1), '')
0: jdbc:hive2://localhost> else split(fullname, " ") end as a from temptable) t;
+-------+--------+--------------+------+--+
| last  | title  |    first     | mid  |
+-------+--------+--------------+------+--+
| abc   | xxx    | def          | ghi  |
|       |        | abc systems  |      |
+-------+--------+--------------+------+--+

If my answer helped to resolve the issue, please accept. It might help others in the community.

avatar

Tahnks Naresh for your response.

there is one more scenario which I received today along with above requirements.

If the individual name has only three words (I mean without title) then the fullname should be like below:

current state:

fullname
jkl! mno pqr

expected

LastTitleFirstMiddle
jklmnopqr

avatar
Expert Contributor

If the input is text data, its better to write a java utility which reads line by line, applies all the rules & re-writes it as four column as per conversion required. Then load that new file into hive to query it.
It will be cumbursome to do it in sql, as it needs another case statement inside case statement to validate the length of the split.