About mark_hadoop

mark_hadoop · ‎08-02-2017

Hi, I have a stream of data coming in to hdfs. I want to store the data in to hive. --------------------------------------------------------------------------------------- Sample data:(data is in single line but with multiple attributes) sample=data1 _source="/s/o/u" destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char" sample=data2 destination="/d/e/s" _ip="0.0.0.0" timestamp=20170802 10:00:00 text="sometext_with$spec_char" _source="/s/o/u" technology="r"o"b"ust" sample=data3 _ip="0.0.0.0" timestamp=20170802 10:00:00destination="/d/e/s" text="sometext_with$spec_char" _source="/s/o/u" --------------------------------------------------------------------------------------- Problems with data: 1.data do not follow same order if you can see (sample_data 1 has source, destination, timestamp, text. sample_data2 has destination,timestamp,text, source e.t.c) 2. the attributes dont follow same convention (_source, destination, _ip, timestamp,text etc; but basically with "_" and with out "_". 3. the attributes are not fixed (sample_data1 has source, destination,timestamp,text; sample_data2 has destination, _ip, timestamp,text,source and technology) sample | source| destination | ip | text | technology | data1 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | NULL| data2 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | r"o"b"ust data3 |a/b/c | /d/e/s | 0.0.0.0 |sometext_with$spec_char | NULL| Thanks for your support

mark_hadoop · ‎08-02-2017

@Matt Clarke I will start a new question. Thanks

mark_hadoop · ‎08-02-2017

@Matt Clarke Also, I need some help, thankful if you could guide me. I have a file in hdfs, which have a lot of fields, which I want to put in to hive. e.g: --------------------------------------------------------------------------------- text in hdfs "These are the attributes to save in hive _source="/a/b/c" _destination="/a/b/d" - - _ip="a.b.c.d" text="hive should save these attributes in different columns"". I made an external table in hive with columnns |source | destination | ip | text | I want to get the key value pairs from above text in hdfs and place in hive in respective columns. --------------------------------------------------------------------------------- In hdfs file, a series of such lines are present, they are unordered and not exactly in the same order of source, destination etc. Any suggestion Thankyou

mark_hadoop · ‎08-02-2017

@Matt Clarke Hi Matt, I have followd your suggestion, I got the expected text. As I am new to Nifi, need more learning. And your suggestions helped me.Thank you.

mark_hadoop · ‎08-02-2017

@Wynner I have replaced RouteOnContent processor, but kept parameters same. Surprisingly, it works pretty fast(seconds). not sure why the old one was not working. Thanks for your extended support.

mark_hadoop · ‎08-01-2017

@Matt Clarke I have used your suggestion, but result is same, it fetches the complete line instead of [hdfs....... .log"] for clarification I will let you know the steps which I am following 1. GetHDFS 2. Splittext: count-1. 3. Extract text: (\[hdfs.*log"\]) 4. Update Attribute 5. PutHDFS not sure why it is pulling complete line? Thanks

mark_hadoop · ‎08-01-2017

Hi , I have stream data (GetHDFS will be running continuosly ) which contains number of lines. e.g: <start>this is 123_@":text coming from [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] linedelimited. A stream of above lines of data will be in file I have to extract text from above message [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] I tried using a extract text processor and used custom property extract: ([hdfs.*log"]). I tried the above in java regex evaluator, it shows correct text extracted. but when I run the flow, output gets the complete text. expected: [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] actual : <start>this is 123_@":text coming from [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] linedelimited. Please help me to correct the regex to extract correct text.

mark_hadoop · ‎07-31-2017

I have changed it to 4 concurrent tasks, and run duration of 2s. for 50k messages it took almost 3 hours (never expected case). eg: a message will be like below this_is_an_example_message <1> [some_"text_and_digits_here"_number="121212"] [some_text_here] --similarly 50k messages routeoncontent configuration: Scheduling: concurrent tasks: 4 Run Schedule: 2s Properties: matchrequirement: content must contain match character set: UTF-8 Content Buffer Size :1MB txt: number="121212" update attribute: filename updated here puthdfs: configurations and path updated here Thanks in advance

mark_hadoop · ‎07-27-2017

I tried with changing the concurrent processes with 100(for testing), tested with 1k messages, it took 11 minutes to complete. Any suggestions, please!!

mark_hadoop · ‎07-27-2017

typically each message from split content processor is <=3KB concurrent processor are 1. Also, every second >50000 messages will be received and splitted and sent to route on content processor. I tested it with 50k messages, till route on content it just takes 2-3 second, but after that it is taking almost 3hours!! I will increase the number of concurrent processors and see, it this helps me to improve the performance

Online	Offline
Last Visited	‎09-20-2021 09:14 AM

Member Since	‎07-14-2017 11:10 AM
Last Visited	‎09-20-2021 09:14 AM
Posts	99
Kudos received	5

Cloudera Community

Re: update TCP stream with batchsize 10000 at once...

Re: listen syslog

Re: puthbasejson

Re: Extract text and Replace text processors regex

Ingesting unformatted, unordered data from hdfs to...

Re: Extract text using Nifi

Re: Extract text using Nifi

Re: Extract text using Nifi

Re: routeoncontent is slow in processing

Re: Extract text using Nifi

Extract text using Nifi

Re: routeoncontent is slow in processing

Re: routeoncontent is slow in processing

Re: routeoncontent is slow in processing