Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig Statement its taking a long time

SOLVED Go to solution

Pig Statement its taking a long time

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

 

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');


Thanks!!!

3 ACCEPTED SOLUTIONS

Accepted Solutions
Highlighted

Re: Pig Statement its taking a long time

Explorer

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Re: Pig Statement its taking a long time

Explorer

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

Re: Pig Statement its taking a long time

Rising Star

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor

 

4 REPLIES 4
Highlighted

Re: Pig Statement its taking a long time

Explorer

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Re: Pig Statement its taking a long time

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea :)

Re: Pig Statement its taking a long time

Explorer

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

 

IPYTHON=1

IPYTHON_OPTS="notebook"

 

Afterwards running ./bin/pyspark

 

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

Re: Pig Statement its taking a long time

Rising Star

Dear Stewart,

 

Here you can read about Spark notebooks:

 

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

 

Best regards,

 

     Gabor