Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NLTK, PySpark on Clob imported string data


NLTK, PySpark on Clob imported string data

Expert Contributor


I am trying to use nltk libraries with pyspark script which pulls a string (<=50MB text data) from hive table and parses the text in the string to pull out some text out of it into separate columns. Anybody has a better solution which can be done at large scale?

Also, i use HDP 2.5.3 with red hat 6.7 according to installation instructions recommended and installed python version is 2.6

Currently i dont have an option to upgrade OS, is it advisable to install python 2.7 (nltk works with python 2.7 and above) in a separate directory and use that directory like this link:

Also, do i need to install python 2.7 only on the client node or all the nodes in the cluster?

Please advise.