question Re: Pig Statement its taking a long time in Archives of Support Questions (Read Only)

Pig Statement its taking a long time

Stewart12586 — Fri, 16 Sep 2022 10:26:05 GMT

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');

Thanks!!!

Re: Pig Statement its taking a long time

MVERVUURT — Mon, 04 Jul 2016 15:54:49 GMT

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Re: Pig Statement its taking a long time

Stewart12586 — Fri, 08 Jul 2016 17:14:24 GMT

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂

Re: Pig Statement its taking a long time

MVERVUURT — Sat, 09 Jul 2016 00:04:22 GMT

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

IPYTHON=1

IPYTHON_OPTS="notebook"

Afterwards running ./bin/pyspark

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

Re: Pig Statement its taking a long time

roczei — Sat, 09 Jul 2016 07:31:17 GMT

Dear Stewart,

Here you can read about Spark notebooks:

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

Best regards,

Gabor