Support Questions

Stewart12586 · ‎06-18-2016

I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename.

First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file?

Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files?

I submit this code and it taking hours processing... All of my files are in Data directory...

Data = LOAD '/user/data' using PigStorage(' ','-tagFile')

STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');

Thanks!!!

MVERVUURT · ‎07-04-2016

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

View solution in original post

MVERVUURT · ‎07-08-2016

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

IPYTHON=1

IPYTHON_OPTS="notebook"

Afterwards running ./bin/pyspark

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

View solution in original post

roczei · ‎07-09-2016

Dear Stewart,

Here you can read about Spark notebooks:

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

Best regards,

Gabor

View solution in original post

MVERVUURT · ‎07-04-2016

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Stewart12586 · ‎07-08-2016

I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂

MVERVUURT · ‎07-08-2016

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark":

IPYTHON=1

IPYTHON_OPTS="notebook"

Afterwards running ./bin/pyspark

NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

roczei · ‎07-09-2016

Dear Stewart,

Here you can read about Spark notebooks:

http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html

Best regards,

Gabor

Cloudera Community

Support Questions

Pig Statement its taking a long time

Taking long time to copy files from hdfs

hive cli access taking long time if standby nameno...

Listsftp taking a long time,

Starting services takes a long time.

When querying a VIEW, query planning takes a long ...

Services takes long time to start if Ranger is dow...

Hive's taking to much time. It is normal?

HCatLoader() Error in Load Pig Statement

Select Statement Inside Case Statement In Impala

Hive query with group by clause stuck in reducer p...