Support Questions

Find answers, ask questions, and share your expertise

Ingesting unformatted, unordered data from hdfs to hive using nifi

avatar
Expert Contributor

Hi,

I have a stream of data coming in to hdfs. I want to store the data in to hive.

---------------------------------------------------------------------------------------

Sample data:(data is in single line but with multiple attributes)

sample=data1 _source="/s/o/u" destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char"

sample=data2 destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char" _source="/s/o/u" technology="r"o"b"ust"

sample=data3 _ip="0.0.0.0" timestamp=20170802 10:00:00destination="/d/e/s" text="sometext_with$spec_char" _source="/s/o/u"

---------------------------------------------------------------------------------------

Problems with data:

1.data do not follow same order

if you can see (sample_data 1 has source, destination, timestamp, text.

sample_data2 has destination,timestamp,text, source e.t.c)

2. the attributes dont follow same convention (_source, destination, _ip, timestamp,text etc; but basically with "_" and with out "_".

3. the attributes are not fixed (sample_data1 has source, destination,timestamp,text; sample_data2 has destination, _ip, timestamp,text,source and technology)

sample | source| destination | ip | text | technology |

data1 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | NULL|

data2 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | r"o"b"ust

data3 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | NULL|

Thanks for your support

1 ACCEPTED SOLUTION

avatar
Super Guru

@Hadoop User

Do all records for "data1" have the same structure? In other words, while data1, data2, and data3 are different from each other, are all data1 like each other and data2 are like each other?

You could use NiFi to route the data using regular expressions with the RouteText processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache... or the RouteOnContent processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache...

This would allow you to land each data type into an appropriate Hive table.

View solution in original post

4 REPLIES 4

avatar
Super Guru

@Hadoop User

Do all records for "data1" have the same structure? In other words, while data1, data2, and data3 are different from each other, are all data1 like each other and data2 are like each other?

You could use NiFi to route the data using regular expressions with the RouteText processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache... or the RouteOnContent processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache...

This would allow you to land each data type into an appropriate Hive table.

avatar
Expert Contributor

@Michael Young

I think I have confused you.

My intention is in hdfs file we have data(say like a log message) in lines i.e, logmessage 1 in line 1, log message2 in line 2 etc.

basically all messages have K:V format (key:value), similary I have around 10 K:V in a line.

It is not mandatory that all 10 K:V should be there in a line (i.e. some times <10 K;V is also possible)

e.g:

k1="v1" k2="v2" k3="v3"... k10="v10"

** Also it is not mandatory that K:V should be in order

i.e.:

k1="v1" k10="v10" k3="v3" k2="v2"... is also possible

Now, My idea is to

1. make a hive table as all keys (k1,k2..) as column names and all v1,v2.. as their column values

2. make a Nifi flow to read the lines(messages) in the hdfs file

3. split the lines

4. match every key with its column name and insert values in to corresponding columns.

Hope I made the question clear.

Can you please help me to approach this.

Thankyou

avatar
Super Guru

@Hadoop User

Ah, that helps clarify things some. You can use SplitText processor (http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache....) to split a file into individual record lines. You could probably use the ExtractText processor (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache...) to extract the K:V pairs and create attributes At point, you should be able to put the data into Hive using PutHiveQL.

While this article isn't doing exactly what you want, it is something you should be able to follow as an example: https://community.hortonworks.com/questions/80211/from-csv-to-hive-via-nifi.html in terms of general flow.

There are some new processors in NiFi 1.3 around RecordReaders and RecordWriters. It may be a little more complicated to get set up at first, but you'll see significantly better performance: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services.... You might find using ScriptedReader allows you to use Python as an easier way to parse the data: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apach...

avatar
Expert Contributor

@Michael Young

Thanks for the suggestion,

I started trying the approach.

1. I did gethdfs to get the file.

2. Splitted the file on lines (count=1)

Here I got a doubt while extracting, if I am not wrong I need to extract each attribute using extract text processor.

today I have 10 attributes, suppose I want to extend my attributes to 1000, then is the same approach to be followed? it become lenghty, isn't it?

And the K:V are not comma saperated they are space saperated, also any value could have space in the middle of it.

e.g: source="abc def ghi jkl" destination="abcdefabc"

I am bit confused, please suggest me