Created on 06-18-2016 04:47 AM - edited 09-16-2022 03:26 AM
I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.
First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?
Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?
I submit this code and it taking hours processing... All of my files are in Data directory...
Data = LOAD '/user/data' using PigStorage(' ','-tagFile')
STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');
Thanks!!!
Created 07-04-2016 08:54 AM
If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.
Created 07-08-2016 05:04 PM
The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":
IPYTHON=1
IPYTHON_OPTS="notebook"
Afterwards running ./bin/pyspark
NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.
Created 07-09-2016 12:31 AM
Dear Stewart,
Here you can read about Spark notebooks:
http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html
Best regards,
Gabor
Created 07-04-2016 08:54 AM
If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.
Created 07-08-2016 10:14 AM
I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂
Created 07-08-2016 05:04 PM
The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":
IPYTHON=1
IPYTHON_OPTS="notebook"
Afterwards running ./bin/pyspark
NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.
Created 07-09-2016 12:31 AM
Dear Stewart,
Here you can read about Spark notebooks:
http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html
Best regards,
Gabor