Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pig Statement its taking a long time

avatar
Rising Star

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

 

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');


Thanks!!!

3 ACCEPTED SOLUTIONS

avatar
Contributor

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

View solution in original post

avatar
Contributor

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

View solution in original post

avatar
Expert Contributor

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor

 

View solution in original post

4 REPLIES 4

avatar
Contributor

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

avatar
Rising Star

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂

avatar
Contributor

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

avatar
Expert Contributor

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor