Created on 12-18-2017 10:57 PM - edited 09-16-2022 05:39 AM
Hi,
I have a table in hive like this:
Ticket# | All_Comments |
a | timestamp1 repid1 comments1 timestamp2 repid2 comments2 timestamp3 repid3 comments3 |
b | timestamp1 repid1 comments1 timestamp2 repid2 comments2 timestamp3 repid3 comments3 timestamp4 repid4 comments4 |
Ticket# | Timestamp | Repid | Comments |
a | timestamp1 | repid1 | comments1 |
a | timestamp2 | repid2 | comments2 |
a | timestamp3 | repid3 | comments3 |
b | timestamp1 | repid1 | comments1 |
b | timestamp2 | repid2 | comments2 |
b | timestamp3 | repid3 | comments3 |
b | timestamp4 | repid4 | comments4 |
I am using pyspark to do this.
Comments are a set of words any length.
I have regex to catch the timestamp and rep id togetehr to split the string by regex ... so this will separate comments.
Not sure how to proceed. Please help.
Created 12-19-2017 04:56 AM
Hi @PJ,
you can perform the same in Hive (under the hive context SQL) using the lateral view explode
SELECT Ticket, pld FROM <your Table> LATERAL VIEW explode(split(All_Comments,"<expression to split>")) expTBL AS pld;
the split will convert the All_elements into Array of Strings(you can use the Regex what you are after to split the time between timestamp and comments).
now the explode convert the uneven column length ( array ) into each element into a row.
I have tested the same in spark aswell and did get the output in desired manner.
yourDf.registerTempTable("someData") hqlContext.sql("""SELECT Ticket, pld FROM someData LATERAL VIEW explode(split(All_Comments,"<expression to split>")) expTBL AS pld;""")
Hope this helps !!
Created 12-19-2017 06:57 PM
Thanks for the reply. when i do this, it is giving me only first match and also the expression i used to split(ts_repid) is not showing in the output.
ts_repid is timestamp<space>repid
Also, expTBL will create a new table? how do i actually show the output of the above command? i tried show() and it only showed first row..
Thanks again.
Created 12-20-2017 01:31 AM
Hi @PJ,
To get the spilts you need to pas two arguments first one is the column name and the 2nd one is the regular expression to split the content of the column.
the output column is the Array of strings ( the 2nd value can be viewed by specifying the index ex: res[2] ),
However explode takes array as input and convert that into the rows (the above pseudo code I have ran in my environment and able to achieve the output you mentioned)
on the other node expTBL is syntax notation and you can keep anything in place of that(need not to be expTBL).
Hope this helps !!
Created 12-20-2017 07:42 PM
it works but i am getting only 1 value, first match . what to do to get all matches?
Created 12-20-2017 11:11 PM
Hi @PJ,
Could you please let me know what the separator you are using split returns an array of elements, you could test with the following sql
select Ticket, split(All_Comments,' <separator you specified to split>') from <your table>
Hope this helps !!
Created 12-21-2017 12:02 AM
The separator I used is
Select tickets, split(all_comments, ‘$’) from table
I also used another separator which says “endhere” but nothing works.
Thanks.
Created 12-21-2017 12:29 AM
Hi @PJ,
'$' signifies end of line in regular expression, that would be the reason to get the content as one split, you can use the escape sequence to handle that with split(all_comments,'\\$').
Hope this helps !!
Created 12-22-2017 02:54 AM
Thanks a lot, the splitting part works.... but i am still getting only the first match ... how do i get all matches?