Support Questions

mark_hadoop · ‎08-03-2017

I have below data in hdfs

a="alphabet_123_a" b="alphabetb" c="alphabet"c" is third one"

b="newb" d="alphabet@/d" a="new a"

a="changed a", b="changed b" c="changed c" e="alphabet e"

My idea is:

1. Make a table in hive as orc, with columns a, b, c,d,e.

2. extract the attributes from the above data.

3. Mapping attributes according to column names in hive and storing them in hive.

4. in first line a,b,c; second line b,d,a; third line a,b,c,e

5. now after extracting all the lines and storing in hive, the values which are not present in lines (e.g. first line dont have "d" and "e"; second line dont have "c" and "e"; third line dont have "d") should be NULL, by the time they store in hive.

Approach

1. Table "details" is created with columns a,b,c,d,e

2. Extract text processor is configured with custom properties as

(a=)(.*?(?=\s\w+=|$)) --- [This will extract "alphabet_123a" in line 1 along with quotes(") at begening and ending of the values

(b=)(.*?(?=\s\w+=|$)) --- [This will extract "aphabetb" in line 1 along with quotes...)

3. I am confused in the replace text processor, as

1. how to remove double quotes?

2. insert NULL values if the corresponding column name is missing in the line?

3. how to generalize the replace text for search value?

Also let me know, how can I change the regex in extract text processor(if necessary)?

Please help me

Thanks

mark_hadoop · ‎11-23-2017

I have sorted it. closing the question.

View solution in original post

mark_hadoop · ‎11-23-2017

I have sorted it. closing the question.

Cloudera Community

Support Questions

Extract text and Replace text processors regex