Archives of Support Questions (Read Only)

Stewart12586 · ‎06-18-2016

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');

Thanks!!!

MVERVUURT · ‎07-04-2016

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

View solution in original post

MVERVUURT · ‎07-08-2016

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

IPYTHON=1

IPYTHON_OPTS="notebook"

Afterwards running ./bin/pyspark

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

View solution in original post

roczei · ‎07-09-2016

Dear Stewart,

Here you can read about Spark notebooks:

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

Best regards,

Gabor

View solution in original post

MVERVUURT · ‎07-04-2016

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Stewart12586 · ‎07-08-2016

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂

MVERVUURT · ‎07-08-2016

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

IPYTHON=1

IPYTHON_OPTS="notebook"

Afterwards running ./bin/pyspark

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

roczei · ‎07-09-2016

Dear Stewart,

Here you can read about Spark notebooks:

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

Best regards,

Gabor

Cloudera Community

Archives of Support Questions (Read Only)

Pig Statement its taking a long time