About bwalter1

bwalter1 · ‎08-04-2016

Assume you have a ORC table "test" in hive that fits to the csv file "test.csv" SparkSQL sqlContext.read.format("com.databricks.spark.csv") .option("header", "true") .option("delimiter", ",") .load("/tmp/test.csv") .insertInto("test")

bwalter1 · ‎08-04-2016

Does the Ambari Server see all virtual machines on the other machine, e.g. are they in the same network and is the Ambari server machine able to resolve the hostnames of the other machine? If so can root from Ambari server machine log into the virtual machines on the other machine wíthout password? These are a few things that need to happen during registration

bwalter1 · ‎08-04-2016

Assume you have a file /tmp/test.csv" like Col1|Col2|Col3|Col4 12|34|"56|78"|9A "AB"|"CD"|EF|"GH:"|:"IJ" If I load it with Spark I get val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .option("delimiter", "|").option("escape", ":").load("/tmp/test.csv") df.show() +----+----+-----+-------+ |Col1|Col2| Col3| Col4| +----+----+-----+-------+ | 12| 34|56|78| 9A| | AB| CD| EF|GH"|"IJ| +----+----+-----+-------+ So the example contains delimiter in quotes and escaped quotes. I use ":" to escape quotes, you can many other characters (don't use e.g. "#") Is this something you want to achieve?

bwalter1 · ‎07-19-2016

Example from the Spark doc page (http://spark.apache.org/docs/latest/submitting-applications.html) # Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 executor-memory is what you want to adapt

bwalter1 · ‎07-18-2016

sorry, it's scala code, but java should work similar

bwalter1 · ‎07-18-2016

Have you tried to avoid folders with empty files? As an idea, instead of using <DStream> .saveAsTextFiles("/tmp/results/ts", "json"); (which creates folders with empty files if nothing gets streamed from the source), I tried <DStream> .foreachRDD(rdd => { try { val f = rdd.first() // fails for empty RDDs rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json") } catch { case e:Exception => println("empty rdd") } }) It seems to work for me. No Folders with empty files any more.

bwalter1 · ‎07-15-2016

This might help: https://community.hortonworks.com/questions/30288/oozie-spark-action-on-hdp-24-nosuchmethoderror-org.html

bwalter1 · ‎07-15-2016

It looks like you are executing the job as user hadoop, However spark wants to execute staging data from/user/yarn (which can only be accessed by yarn). How did you start the job and with which user? I am surprised that spark uses /user/yarn as staging dir for user hadoop. Is there any staging dir configuration in your system (SPARK_YARN_STAGING_DIR)?

bwalter1 · ‎07-14-2016

I don't know where the TFS bit comes from, maybe some dependency problems. For including all dependencies in the workflow I would recommend to go for a fat jar (assembly). In scala with sbt you can see the idea here Creating fat jars with sbt. Same works with maven's "maven-assembly-plugin". You should be able to call your code as spark-submit --master yarn-cluster \ --num-executors 2 --driver-memory 1g --executor-memory 2g --executor-cores 2 \ --class com.SparkSqlExample \ /home/hadoop/SparkParquetExample-0.0.1-SNAPSHOT-with-depencencies.jar If this works, the jar with dependencies should be the one in the oozie spark action.

bwalter1 · ‎07-13-2016

I installed it manually, it was quite straightforward. However you need maven 3.3, else some npm stuff will fail. I just did "mvn clean package -DskipTests" I then copied conf/zeppelin-env.sh.template to conf/zeppelin-env.sh and added export JAVA_HOME=/usr/jdk64/jdk1.8.0_60/ export SPARK_HOME=/usr/hdp/current/spark-client export HADOOP_HOME=/usr/hdp/current/hadoop-client and copied zeppelin-site.xml.template to zeppelin-site.xml and changed port to 9995 Plus in Zeppelin for the Spark interpreter I changed the "master" property to yarn-client. Seems to work for me on a HDP 2.4.2 cluster

Online	Offline
Last Visited	‎06-07-2017 08:08 AM

Member Since	‎10-07-2015 10:28 PM
Last Visited	‎06-07-2017 08:08 AM
Posts	107
Kudos received	72

Cloudera Community

Re: Spark and HIVE

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: ReplaceText Regex to replace double quotes in ...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: How to load CSV file directly into Hive ORC ta...

Re: Ambari support for Hadoop cluster that is dist...

Re: Escaping double quotes in spark dataframe

Re: HDP 2.4 Sandbox - How to check JVM size and ho...

Re: Deleting Directory in HDFS using Spark

Re: Deleting Directory in HDFS using Spark

Re: Executing Spark action in Oozie using yarn cl...

Re: Executing Spark action in Oozie using yarn cl...

Re: Executing Spark action in Oozie using yarn cl...

Re: Custom Zeppelin interpreter not found (HDP San...