Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

Highlighted

After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

New Contributor
 
5 REPLIES 5
Highlighted

Re: After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

Contributor

Run a "Replace text processor" with a regex configured to capture the character you want to kill. If you have already extracted it from the JSON you can use something like ([^a-zA-Z-]).

Generally though replace text evaluate json path will be good processors to start with. A flow might look like

  1. Evaluate Json path to extract the text to an attribute
  2. Replace the flow file content with the attribute formatted as you want(this does erase the original json from twitter in this strategy)
  3. Clean up the flow file with a regex like ([^a-zA-Z-]) and replace it with nothing or a space.
  4. Submit to nlp or wherever its going.

Re: After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

New Contributor

How would you remove hyperlinks?

@Chris Gambino

Highlighted

Re: After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

Contributor

@Monil Patel

Is your idea to remove hyperlinks in their entirety? A java/javascript style regex to detect any URL would be a good start. This guy covers a few strategies around it.

http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

Highlighted

Re: After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

New Contributor

Sorry I meant removing URLs in there entirety so that it won't affect running NLP on the text itself.

Highlighted

Re: After getting tweet and accessing tweet text via (getTwitter and EvaluateJsonPath in Nifi), how do I remove special characters(/n,/t, $, and #) from the text to run NLP on the tweet itself?

Contributor

Use the replace text processor to use a regex search term to replace all the URLs.

Don't have an account?
Coming from Hortonworks? Activate your account here