Member since
05-06-2014
14
Posts
3
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1998 | 07-08-2016 05:04 PM | |
3314 | 07-06-2016 02:42 AM | |
2044 | 07-04-2016 08:54 AM |
07-08-2016
05:04 PM
1 Kudo
The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark": IPYTHON=1 IPYTHON_OPTS="notebook" Afterwards running ./bin/pyspark NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.
... View more
07-07-2016
07:55 AM
I would recommend the SparkR package which works similarly as the dplyr package. I find it a lot easier to use than RHadoop which is still based on MapReduce under the hood. The big data community is moving rapidly towards Spark. For more information about SparkR please see the cloudera community post here under. https://community.cloudera.com/t5/Data-Science-and-Machine/Spark-R-in-Cloudera-5-3-0/td-p/37706
... View more
07-07-2016
12:26 AM
I am not sure if what you want to achieve is possible yet using different virtual envs on the master and worker nodes. However you could try to create virtual envs on all the nodes at the same location using Ansible or Puppet. Afterwards modify the spark-env.sh, this script is executed on all nodes when a Spark job runs. So activate the desired virtual env in the spark-env.sh and set the environment variable PYSPARK_PYTHON to the location of python in the desired virtual env. Otherwise another alternative could be to use YARN with Docker Containers. However this requires some research to get it working. However the theoretical idea would be to have the Spark Driver and Executors running in Docker containers provided with the desired python libraries. Fingers crossed 😉
... View more
07-06-2016
08:19 AM
It may be a bit of a long shot, but you could mount the directories of your remote server in your local server using samba and afterwards copy the files to hdfs from the command line.
... View more
07-06-2016
08:13 AM
I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step. * http://quant-econ.net/py/ipython.html#debugging * https://docs.python.org/3/library/pdb.html Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython. Good luck.
... View more
07-06-2016
02:42 AM
1 Kudo
(1) I would start by loading the SparkR package into RStudio so you can make use of it. See the following link under heading "Using SparkR from RStudio" https://github.com/apache/spark/tree/master/R. (2) Now you are ready to run through the following tutorial. However instead of reading the data from "hdfs" load it from your local file system. http://www.r-bloggers.com/a-first-look-at-spark/ (3) Study the SparkR Guide to gain more indepth knowledge. http://spark.apache.org/docs/latest/sparkr.html (4) Study Spark (Dataframes, RDDs, etc) for example with the Oreilly book "Learning Spark". I find that it always helps to understand how something works under the hood. The same holds for SparkR you can easily find some videos about Youtube to understand how it works under the hood, especially the distributed character of SparkR + Spark.
... View more
07-04-2016
08:54 AM
1 Kudo
If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.
... View more
07-10-2014
02:19 AM
I got reply on JIRA that hbase-site.xml is read from HBASE_CLASSPATH. I have set this environment variable in hbase-env.sh, but it did not work successfully yet. "hbase-site.xml is read from the HBASE_CLASSPATH.
... View more
07-08-2014
02:30 AM
Thanks for tip! This indeed works, but the solution we are searching for is to have bhase-site.xml picked up automatically for our different environments. I submitted a bug by the way to the Hbase JIRA: https://issues.apache.org/jira/browse/HBASE-11478 Lets hope we can fix this quickly.
... View more
07-03-2014
09:35 AM
Dear Experts, We have written a PIG Java UDF which fetches a record one at a time from hbase: Configuration config = HBaseConfiguration.create(); // This instantiates an HTable object that connects you to HTable table = new HTable(config, hbaseTable); // Get the Hbase row for the corresponding rowkey Get hbaseRow = new Get(Bytes.toBytes(rowkey)); Result resultRow = table.get(hbaseRow); However when running and using our UDF in Pig, it cannot find the hbase-site.xml and searches for our zookeeperQuorum on localhost instead of what is specified in the Config file. However when we use the PiggyBank HbaseStorage, we don't have any problems. I have tried setting PIG_CLASSPATH, PIG_OPTS, etc. However it doesn't work. I would appreciate your help! Greetings, Mark
... View more
06-02-2014
01:51 AM
Hi, Could you try it with? Here I added an extra slash. records = load 'hdfs:///localhost:8020/user/cloudera/Employee_pig' AS(A:chararray,B:chararray,C:chararray,D:chararray); or with? records = load '/user/cloudera/Employee_pig'; Cheers, Mark
... View more
06-02-2014
01:49 AM
Is it possible to write the a UDF that works as following? B= load ... A = PigUDF(B) I keep getting the following error "Cannot expand macro 'PigUDF'. Reason: Macro must be defined before expansion." . Thank you
... View more
05-20-2014
12:16 AM
Dear Experts, I have read that external jars or libraries used in Java Pig UDFs can be registered the following way: * REGISTER ../path/library.jar * pig -Dpig.additional.jars=/local/path/to/your.jar However I was writing a my SHA1 UDF based on Apache Commons Codec and wasn't succesful in registering this. I had the error that the method sha1hex could not be found. I had the following maven dependencies: <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> </dependency> <dependency> <groupId>org.apache.pig</groupId> <artifactId>piggybank</artifactId> <version>0.12.1</version> </dependency> <dependency> <groupId>commons-codec</groupId> <artifactId>commons-codec</artifactId> <version>1.9</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.3.2</version> </dependency> Maybe the issue lies in dependency conflicts, both Apache commons-lang3 and commons-codec are also contained in the piggybank dependency; but these are pretty old. Eventually I ended up deleting the two apache dependencies and use the libs/jars already contained in Piggybank. Why wasn't I able to REGISTER my external jars? How can I register them succesfully? thanks for your help! Greetings, Mark
... View more
05-07-2014
12:02 AM
Dear Experts, Would you happen to have succesful caes where you override Hbase Timestamps with your own timestamps? Throughout the documentation this is not advised, however we are importing historical data into Hbase. I would be grateful if you have a step by step guide or could help us determine the consequences? Does a hbase scan using a timerange query perform a full table scan? Thank you. Mark
... View more