Support Questions

Find answers, ask questions, and share your expertise

split column by regex and create a table

avatar
Expert Contributor

Hi,

I have a table in hive like this:

Ticket#All_Comments
atimestamp1 repid1 comments1
timestamp2 repid2 comments2 timestamp3 repid3 comments3
btimestamp1 repid1 comments1 timestamp2 repid2 comments2

timestamp3 repid3 comments3 timestamp4 repid4 comments4

Ticket#TimestampRepidComments
atimestamp1repid1comments1
atimestamp2repid2comments2
atimestamp3repid3comments3
btimestamp1repid1comments1
btimestamp2repid2comments2
btimestamp3repid3comments3
btimestamp4repid4comments4

I am using pyspark to do this.

Comments are a set of words any length.

I have regex to catch the timestamp and rep id togetehr to split the string by regex ... so this will separate comments.

Not sure how to proceed. Please help.

8 REPLIES 8

avatar
Super Collaborator

Hi @PJ,

you can perform the same in Hive (under the hive context SQL) using the lateral view explode

SELECT Ticket, pld
FROM <your Table> LATERAL VIEW explode(split(All_Comments,"<expression to split>")) expTBL AS pld;

the split will convert the All_elements into Array of Strings(you can use the Regex what you are after to split the time between timestamp and comments).

now the explode convert the uneven column length ( array ) into each element into a row.

I have tested the same in spark aswell and did get the output in desired manner.

yourDf.registerTempTable("someData")
hqlContext.sql("""SELECT Ticket, pld FROM someData LATERAL VIEW explode(split(All_Comments,"<expression to split>")) expTBL AS pld;""")

Hope this helps !!

avatar
Expert Contributor

@bkosaraju

Thanks for the reply. when i do this, it is giving me only first match and also the expression i used to split(ts_repid) is not showing in the output.

ts_repid is timestamp<space>repid

Also, expTBL will create a new table? how do i actually show the output of the above command? i tried show() and it only showed first row..

Thanks again.

avatar
Super Collaborator

Hi @PJ,

To get the spilts you need to pas two arguments first one is the column name and the 2nd one is the regular expression to split the content of the column.

the output column is the Array of strings ( the 2nd value can be viewed by specifying the index ex: res[2] ),

However explode takes array as input and convert that into the rows (the above pseudo code I have ran in my environment and able to achieve the output you mentioned)

on the other node expTBL is syntax notation and you can keep anything in place of that(need not to be expTBL).

Hope this helps !!

avatar
Expert Contributor

@bkosaraju

it works but i am getting only 1 value, first match . what to do to get all matches?

avatar
Super Collaborator

Hi @PJ,

Could you please let me know what the separator you are using split returns an array of elements, you could test with the following sql

select Ticket, split(All_Comments,' <separator you specified to split>') from <your table>

Hope this helps !!

avatar
Expert Contributor

@bkosaraju

The separator I used is

Select tickets, split(all_comments, ‘$’) from table

I also used another separator which says “endhere” but nothing works.

Thanks.

avatar
Super Collaborator

Hi @PJ,

'$' signifies end of line in regular expression, that would be the reason to get the content as one split, you can use the escape sequence to handle that with split(all_comments,'\\$').

Hope this helps !!

avatar
Expert Contributor

@bkosaraju

Thanks a lot, the splitting part works.... but i am still getting only the first match ... how do i get all matches?