Member since
05-06-2014
14
Posts
3
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1998 | 07-08-2016 05:04 PM | |
3315 | 07-06-2016 02:42 AM | |
2044 | 07-04-2016 08:54 AM |
06-06-2019
11:09 PM
I added above values and that was causing https to shutdown. After deleting those values , it started and working fine now. Thanks @Harsh J for your reply.
... View more
01-29-2019
06:31 AM
Hi, SparkR has been removed from R Cran repository.Could you please advise what needs to be installed instead. Thanks Sankar
... View more
07-09-2016
12:31 AM
2 Kudos
Dear Stewart, Here you can read about Spark notebooks: http://www.cloudera.com/documentation/enterprise/latest/topics/spark_ipython.html Best regards, Gabor
... View more
07-07-2016
07:55 AM
I would recommend the SparkR package which works similarly as the dplyr package. I find it a lot easier to use than RHadoop which is still based on MapReduce under the hood. The big data community is moving rapidly towards Spark. For more information about SparkR please see the cloudera community post here under. https://community.cloudera.com/t5/Data-Science-and-Machine/Spark-R-in-Cloudera-5-3-0/td-p/37706
... View more
07-07-2016
12:26 AM
I am not sure if what you want to achieve is possible yet using different virtual envs on the master and worker nodes. However you could try to create virtual envs on all the nodes at the same location using Ansible or Puppet. Afterwards modify the spark-env.sh, this script is executed on all nodes when a Spark job runs. So activate the desired virtual env in the spark-env.sh and set the environment variable PYSPARK_PYTHON to the location of python in the desired virtual env. Otherwise another alternative could be to use YARN with Docker Containers. However this requires some research to get it working. However the theoretical idea would be to have the Spark Driver and Executors running in Docker containers provided with the desired python libraries. Fingers crossed 😉
... View more
07-06-2016
08:13 AM
I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step. * http://quant-econ.net/py/ipython.html#debugging * https://docs.python.org/3/library/pdb.html Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython. Good luck.
... View more
07-20-2014
07:15 AM
Please post your Pig grunt log, which will carry the specific exception that prevented your JAR from being registered.
... View more
07-20-2014
07:11 AM
What do you intend the UDF to do exactly during the statement: X = UDF(Y)? What is your UDF trying to achieve? A UDF can process a row or a column, but it needs to be coupled with an actual logical Pig statement to be invoked, such as FOREACH, etc. Such a syntax does not make much sense in Pig's context.
... View more
07-10-2014
02:19 AM
I got reply on JIRA that hbase-site.xml is read from HBASE_CLASSPATH. I have set this environment variable in hbase-env.sh, but it did not work successfully yet. "hbase-site.xml is read from the HBASE_CLASSPATH.
... View more
06-02-2014
01:51 AM
Hi, Could you try it with? Here I added an extra slash. records = load 'hdfs:///localhost:8020/user/cloudera/Employee_pig' AS(A:chararray,B:chararray,C:chararray,D:chararray); or with? records = load '/user/cloudera/Employee_pig'; Cheers, Mark
... View more
05-07-2014
07:28 AM
In my experience, and as you have seen in the docs, it is generally not advisable to manually manipulate data timestamps as these are used in cell versioning by HBase. HBase assigns the current time in EPOCH seconds to any piece of data you insert into a table and by default HBase will keep multiple versions of each cell, so if you override a particular cell's value later, the updated value has a newer timestamp. This allows you to go back and retrieve older versions of that cell if you choose.
If you are manually overriding timestamps when you insert data and you have two copies of a particular cell (eg. one that has been persisted to an HFile already, and one that is only in memstore), and then try to read that cell, HBase will not know which value is current if the timestamps are the same.
You are better off using a column that stores the date/time of the data and make your application use that column as it's index for chronologically organizing the data. I wouldn't mess with the internal timestamps that HBase uses to organize it's data.
... View more