Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Extract text using Nifi

avatar
Expert Contributor

Hi ,

I have stream data (GetHDFS will be running continuosly ) which contains number of lines.

e.g:

<start>this is 123_@":text coming from [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] linedelimited.

A stream of above lines of data will be in file

I have to extract text from above message

[hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"]

I tried using a extract text processor and used custom property

extract: ([hdfs.*log"]).

I tried the above in java regex evaluator, it shows correct text extracted. but when I run the flow, output gets the complete text.

expected: [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"]

actual : <start>this is 123_@":text coming from [hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"] linedelimited.

Please help me to correct the regex to extract correct text.

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Hadoop User

Your Java regular expression needs to escape the "[" and "]" since they have reserved meaning in Java.

Try using the following java regular expression instead:

(\[hdfs.*log"\])

Thanks,

Matt

View solution in original post

7 REPLIES 7

avatar
Super Mentor

@Hadoop User

Your Java regular expression needs to escape the "[" and "]" since they have reserved meaning in Java.

Try using the following java regular expression instead:

(\[hdfs.*log"\])

Thanks,

Matt

avatar
Expert Contributor

@Matt Clarke I have used your suggestion, but result is same, it fetches the complete line instead of [hdfs....... .log"]

for clarification I will let you know the steps which I am following

1. GetHDFS

2. Splittext: count-1.

3. Extract text:

  1. (\[hdfs.*log"\])

4. Update Attribute

5. PutHDFS

not sure why it is pulling complete line?

Thanks

avatar
Super Mentor

@Hadoop User

The ExtractText processor will extract the text that matches your regex and assign it to an attribute matching the property name on the FlowFile. The content of the FlowFile remains unchanged. Then you update a FlowFiles Attribute and finally use PutHDFS to write the content (which at this time you have not changed at all) to HDFS.

If your intent is to write the modified string to HDFS, you need to update the actual content of the FlowFile and nit just create and modify attributes. For that use case, you would want to use ReplaceText processor instead.

You would configure ReplaceText similar to the following:

23384-screen-shot-2017-08-01-at-122929-pm.png

The above will result in the actual content of the FlowFile being changed to:

[hdfs file="/a/b/c" and' the; '''', "", file is streamed. The location=["/location"] and log is some.log"]

Thanks,

Matt

avatar
Expert Contributor

@Matt Clarke Hi Matt,

I have followd your suggestion, I got the expected text.

As I am new to Nifi, need more learning. And your suggestions helped me.Thank you.

avatar
Expert Contributor

@Matt Clarke

Also, I need some help, thankful if you could guide me.

I have a file in hdfs, which have a lot of fields, which I want to put in to hive.

e.g:

---------------------------------------------------------------------------------

text in hdfs

"These are the attributes to save in hive _source="/a/b/c" _destination="/a/b/d" - - _ip="a.b.c.d" text="hive should save these attributes in different columns"".

I made an external table in hive with columnns

|source | destination | ip | text |

I want to get the key value pairs from above text in hdfs and place in hive in respective columns.

---------------------------------------------------------------------------------

In hdfs file, a series of such lines are present, they are unordered and not exactly in the same order of source, destination etc.

Any suggestion

Thankyou

avatar
Super Mentor

@Hadoop User

Please start a new question rather then asking multiple unrelated questions in a single post. This makes it easier for community users to find similar issues.


It also help other members identify unanswered questions so they may address them. This question would likely go unnoticed otherwise.

I would need to do some investigation to come up with a good solution, but other community members may have already handled this exact scenario. By starting a new question, all members following the "data-processing" or "nifi-processor" or "nifi-streaming" will get notified of your question.

Thanks,

Matt

avatar
Expert Contributor

@Matt Clarke

I will start a new question.

Thanks