<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Ingesting unformatted, unordered data from hdfs to hive using nifi in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210466#M172408</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a stream of data coming in to hdfs. I want to store the data in to hive.&lt;/P&gt;&lt;P&gt;---------------------------------------------------------------------------------------&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Sample data:(data is in single line but with multiple attributes)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;sample=data1 _source="/s/o/u" destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char"&lt;/P&gt;&lt;P&gt;sample=data2 destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char" _source="/s/o/u" technology="r"o"b"ust"&lt;/P&gt;&lt;P&gt;sample=data3 _ip="0.0.0.0" timestamp=20170802 10:00:00destination="/d/e/s" text="sometext_with$spec_char" _source="/s/o/u" &lt;/P&gt;&lt;P&gt;---------------------------------------------------------------------------------------&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Problems with data:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1.data do not follow same order&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;if you can see (sample_data 1 has source, destination, timestamp, text.&lt;/P&gt;&lt;P style="margin-left: 120px;"&gt; sample_data2 has destination,timestamp,text, source e.t.c)&lt;/P&gt;&lt;P&gt;2. the attributes dont follow same convention (_source, destination, _ip, timestamp,text etc; but basically with "_" and with out "_".&lt;/P&gt;&lt;P&gt;3. the attributes are not fixed (sample_data1 has source, destination,timestamp,text; sample_data2 has destination, _ip, timestamp,text,source and technology)&lt;/P&gt;&lt;P&gt; sample | source| destination | ip         | text                   | technology | &lt;/P&gt;&lt;P&gt;data1  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | NULL|&lt;/P&gt;&lt;P&gt;data2  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | r"o"b"ust &lt;/P&gt;&lt;P&gt;data3  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | NULL|&lt;/P&gt;&lt;P&gt;Thanks for your support&lt;/P&gt;</description>
    <pubDate>Wed, 02 Aug 2017 20:50:08 GMT</pubDate>
    <dc:creator>mark_hadoop</dc:creator>
    <dc:date>2017-08-02T20:50:08Z</dc:date>
    <item>
      <title>Ingesting unformatted, unordered data from hdfs to hive using nifi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210466#M172408</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a stream of data coming in to hdfs. I want to store the data in to hive.&lt;/P&gt;&lt;P&gt;---------------------------------------------------------------------------------------&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Sample data:(data is in single line but with multiple attributes)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;sample=data1 _source="/s/o/u" destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char"&lt;/P&gt;&lt;P&gt;sample=data2 destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char" _source="/s/o/u" technology="r"o"b"ust"&lt;/P&gt;&lt;P&gt;sample=data3 _ip="0.0.0.0" timestamp=20170802 10:00:00destination="/d/e/s" text="sometext_with$spec_char" _source="/s/o/u" &lt;/P&gt;&lt;P&gt;---------------------------------------------------------------------------------------&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Problems with data:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1.data do not follow same order&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;if you can see (sample_data 1 has source, destination, timestamp, text.&lt;/P&gt;&lt;P style="margin-left: 120px;"&gt; sample_data2 has destination,timestamp,text, source e.t.c)&lt;/P&gt;&lt;P&gt;2. the attributes dont follow same convention (_source, destination, _ip, timestamp,text etc; but basically with "_" and with out "_".&lt;/P&gt;&lt;P&gt;3. the attributes are not fixed (sample_data1 has source, destination,timestamp,text; sample_data2 has destination, _ip, timestamp,text,source and technology)&lt;/P&gt;&lt;P&gt; sample | source| destination | ip         | text                   | technology | &lt;/P&gt;&lt;P&gt;data1  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | NULL|&lt;/P&gt;&lt;P&gt;data2  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | r"o"b"ust &lt;/P&gt;&lt;P&gt;data3  |a/b/c  | /d/e/s      |  0.0.0.0   |sometext_with$spec_char | NULL|&lt;/P&gt;&lt;P&gt;Thanks for your support&lt;/P&gt;</description>
      <pubDate>Wed, 02 Aug 2017 20:50:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210466#M172408</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2017-08-02T20:50:08Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting unformatted, unordered data from hdfs to hive using nifi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210467#M172409</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Hadoop User&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Do all records for "data1" have the same structure?  In other words, while data1, data2, and data3 are different from each other, are all data1 like each other and data2 are like each other?&lt;/P&gt;&lt;P&gt;You could use NiFi to route the data using regular expressions with the RouteText processor: &lt;A target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteText/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteText/index.html&lt;/A&gt; or the RouteOnContent processor: &lt;A target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteOnContent/index.html"&gt;&lt;/A&gt;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteOnContent/index.html" target="_blank"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteOnContent/index.html&lt;/A&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;This would allow you to land each data type into an appropriate Hive table.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Aug 2017 00:40:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210467#M172409</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2017-08-03T00:40:33Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting unformatted, unordered data from hdfs to hive using nifi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210468#M172410</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2695/myoung.html" nodeid="2695"&gt;@Michael Young&lt;/A&gt; &lt;/P&gt;&lt;P&gt;I think I have confused you.&lt;/P&gt;&lt;P&gt;My intention is in hdfs file we have data(say like a log message) in lines i.e, logmessage 1 in line 1, log message2 in line 2 etc.&lt;/P&gt;&lt;P&gt;basically all messages have K:V format (key:value), similary I have around 10 K:V in a line.&lt;/P&gt;&lt;P&gt;It is not mandatory that all 10 K:V should be there in a line (i.e. some times &amp;lt;10 K;V is also possible)&lt;/P&gt;&lt;P&gt;e.g:&lt;/P&gt;&lt;P&gt;k1="v1" k2="v2" k3="v3"... k10="v10"&lt;/P&gt;&lt;P&gt;** Also it is not mandatory that K:V should be in order&lt;/P&gt;&lt;P&gt;i.e.:&lt;/P&gt;&lt;P&gt;k1="v1" k10="v10" k3="v3" k2="v2"... is also possible&lt;/P&gt;&lt;P&gt;Now, My idea is to &lt;/P&gt;&lt;P&gt;1. make a hive table as all keys (k1,k2..) as column names and all v1,v2.. as their column values&lt;/P&gt;&lt;P&gt;2. make a Nifi flow to read the lines(messages) in the hdfs file&lt;/P&gt;&lt;P&gt;3. split the lines&lt;/P&gt;&lt;P&gt;4. match every key with its column name and insert values in to corresponding columns.&lt;/P&gt;&lt;P&gt;Hope I made the question clear.&lt;/P&gt;&lt;P&gt;Can you please help me to approach this.&lt;/P&gt;&lt;P&gt;Thankyou&lt;/P&gt;</description>
      <pubDate>Thu, 03 Aug 2017 04:33:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210468#M172410</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2017-08-03T04:33:29Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting unformatted, unordered data from hdfs to hive using nifi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210469#M172411</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Hadoop User&lt;BR /&gt;&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Ah, that helps clarify things some.  You can use SplitText processor (&lt;A target="_blank" href="http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.SplitText/index.html"&gt;http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.SplitText/index.html&lt;/A&gt;) to split a file into individual record lines.  You could probably use the ExtractText processor (&lt;A target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ExtractText/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ExtractText/index.html&lt;/A&gt;) to extract the K:V pairs and create attributes  At point, you should be able to put the data into Hive using PutHiveQL.&lt;/P&gt;&lt;P&gt;While this article isn't doing exactly what you want, it is something you should be able to follow as an example: &lt;A target="_blank" href="https://community.hortonworks.com/questions/80211/from-csv-to-hive-via-nifi.html"&gt;https://community.hortonworks.com/questions/80211/from-csv-to-hive-via-nifi.html&lt;/A&gt; in terms of general flow.&lt;/P&gt;&lt;P&gt;There are some new processors in NiFi 1.3 around RecordReaders and RecordWriters.  It may be a little more complicated to get set up at first, but you'll see significantly better performance: &lt;A target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.3.0/org.apache.nifi.csv.CSVReader/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.3.0/org.apache.nifi.csv.CSVReader/index.html&lt;/A&gt;.  You might find using ScriptedReader allows you to use Python as an easier way to parse the data: &lt;A target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apache.nifi.record.script.ScriptedReader/index.html"&gt;https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apache.nifi.record.script.ScriptedReader/index.html&lt;/A&gt;&lt;BR /&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Thu, 03 Aug 2017 05:05:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210469#M172411</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2017-08-03T05:05:17Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting unformatted, unordered data from hdfs to hive using nifi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210470#M172412</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2695/myoung.html" nodeid="2695"&gt;@Michael Young&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thanks for the suggestion,&lt;/P&gt;&lt;P&gt;I started trying the approach.&lt;/P&gt;&lt;P&gt;1. I did gethdfs to get the file.&lt;/P&gt;&lt;P&gt;2. Splitted the file on lines (count=1)&lt;/P&gt;&lt;P&gt;Here I got a doubt while extracting, if I am not wrong I need to extract each attribute using extract text processor.&lt;/P&gt;&lt;P&gt;today I have 10 attributes, suppose I want to extend my attributes to 1000, then is the same approach to be followed? it become lenghty, isn't it?&lt;/P&gt;&lt;P&gt;And the K:V are not comma saperated they are space saperated, also any value could have space in the middle of it.&lt;/P&gt;&lt;P&gt;e.g: source="abc def ghi jkl" destination="abcdefabc"&lt;/P&gt;&lt;P&gt;I am bit confused, please suggest me&lt;/P&gt;</description>
      <pubDate>Thu, 03 Aug 2017 15:50:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Ingesting-unformatted-unordered-data-from-hdfs-to-hive-using/m-p/210470#M172412</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2017-08-03T15:50:57Z</dc:date>
    </item>
  </channel>
</rss>

