Support Questions

Find answers, ask questions, and share your expertise

Extract text and Replace text processors regex

avatar
Expert Contributor

I have below data in hdfs

a="alphabet_123_a" b="alphabetb" c="alphabet"c" is third one"

b="newb" d="alphabet@/d" a="new a"

a="changed a", b="changed b" c="changed c" e="alphabet e"

My idea is:

1. Make a table in hive as orc, with columns a, b, c,d,e.

2. extract the attributes from the above data.

3. Mapping attributes according to column names in hive and storing them in hive.

4. in first line a,b,c; second line b,d,a; third line a,b,c,e

5. now after extracting all the lines and storing in hive, the values which are not present in lines (e.g. first line dont have "d" and "e"; second line dont have "c" and "e"; third line dont have "d") should be NULL, by the time they store in hive.

Approach

1. Table "details" is created with columns a,b,c,d,e

2. Extract text processor is configured with custom properties as

(a=)(.*?(?=\s\w+=|$)) --- [This will extract "alphabet_123a" in line 1 along with quotes(") at begening and ending of the values

(b=)(.*?(?=\s\w+=|$)) --- [This will extract "aphabetb" in line 1 along with quotes...)

3. I am confused in the replace text processor, as

1. how to remove double quotes?

2. insert NULL values if the corresponding column name is missing in the line?

3. how to generalize the replace text for search value?

Also let me know, how can I change the regex in extract text processor(if necessary)?

Please help me

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I have sorted it. closing the question.

View solution in original post

1 REPLY 1

avatar
Expert Contributor

I have sorted it. closing the question.