Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig Statement its taking a long time

avatar
Rising Star

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

 

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');


Thanks!!!

3 ACCEPTED SOLUTIONS

avatar
Contributor

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

View solution in original post

avatar
Contributor

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

View solution in original post

avatar
Expert Contributor

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor

 

View solution in original post

4 REPLIES 4

avatar
Contributor

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

avatar
Rising Star

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂

avatar
Contributor

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

avatar
Expert Contributor

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor