Member since
04-05-2016
36
Posts
8
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1130 | 07-30-2019 11:52 PM | |
3105 | 06-07-2019 01:01 AM | |
8152 | 04-14-2017 08:31 PM | |
4082 | 08-03-2016 12:52 AM | |
1962 | 06-22-2016 02:10 AM |
07-30-2019
11:52 PM
1 Kudo
Since the "list" commands gets the apps from the ResourceManager and doesn't set any explicit filters and limits (except those provided with it) on the request, technically it returns all the applications which are present with RM at the moment. That number is controlled by "yarn.resourcemanager.max-completed-applications" config. Hope that clarifies.
... View more
06-07-2019
01:01 AM
1 Kudo
As your intent seems to capture the driver logs in a separate file while executing the app in the cluster mode, make sure that ' /some/path/to/edgeNode/' dir is present on all of the NodeManager essentially as in cluster mode the driver will be running in the Yarn app's application master. If you can't make sure that follow a general practice to provide log file path to some pre-existing paths e.g. "/var/log/SparkDriver.log".
... View more
05-14-2019
02:42 AM
Please check if numpy is actually installed on all of the nodemanagers, if not, install it using below command (for python2.x) : pip install numpy If already installed, let us know the following: 1) Can you execute the same command outside of hue i.e. using Spark2-submit ? Mention the full command here. 2) What spark command you use in Hue?
... View more
08-20-2018
12:23 AM
Thanks for reporting 404 for that parcel URL, apologies for the inconvenience caused. However, I can see that the mentioned JIRA ( SPARK-22306 ) fix is present in below CDS releases : SPARK2-2.3.0-CLOUDERA1 SPARK2-2.3.0-CLOUDERA2 SPARK2-2.3.0-CLOUDERA3 So feel free to use below link to download the CDS 2.3 release 3 parcels in the meantime: http://archive.cloudera.com/spark2/parcels/2.3.0.cloudera3/
... View more
03-06-2018
10:25 PM
I believe you can achieve this by following the below sequence: spark.sql("SET spark.sql.shuffle.partitions=12") Execute operations on small table spark.sql("SET spark.sql.shuffle.partitions=500") Execute operations on larger table
... View more
02-19-2018
11:08 PM
It would be good if you can attach full stacktrace of the error which you are seeing. As a side note, make sure that you have added Hive gateway role on the host from which you are submitting spark app.
... View more
04-14-2017
08:31 PM
It is the below line which is setting the data types for both the fields as StringType: val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) You can define your custom schema as follows : val customSchema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true))) You can add additional fields as well in the above schema definition. And then you can use this customSchema while creating the dataframe as follows: val peopleDataFrame = sqlContext.createDataFrame(rowRDD, customSchema) Also for details, please see this page.
... View more
03-06-2017
10:30 PM
By Spark 2.1 do you mean Cloudera Spark 2.0 Release 1 or Apache Spark 2.1 ? Regarding Cloudera Spark 2.0 Release 1 or Release 2, I would like to tell that minimum required CDH version is CDH 5.7.x but you are on CDH 5.5.4.
... View more
08-30-2016
10:04 PM
The valuable information is at very bottom: NameError: name 'master' is not defined Please make sure you have defined variable "master" in your code. Or if you are specifying master via spark-submit, you should not set it in code.
... View more
08-03-2016
08:08 AM
You don't need to export as a JAR for unit testing. You can do : SparkConf().setMaster(local[2]) and run the program as usual Java application in IDE. Also make sure that you have all the dependent libraries in the classpath.
... View more
08-03-2016
12:52 AM
1 Kudo
You are getting this exception because "sc.testFile" r eads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. You said that you want to get the data from URL and want to save it to HDFS, then you should do: val data = scala.io.Source.fromURL("http://10.3.9.34:9900/messages").mkString
val list = data.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
rdds.saveAsTextFile(outputDirectory)
... View more
- Tags:
- spark streaming
06-22-2016
02:10 AM
1 Kudo
The attached log indicates that application is accepted by cluster manager (YARN) but unable to execute due to resource crunch. Please make sure there are enough resources available in your cluster while submitting the job. Do check following and configure based on hosts: yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores yarn.scheduler.maximum-allocation-mb yarn.scheduler.minimum-allocation-mb yarn-scheduler.maximum-allocation-vcores yarn.scheduler.minimum-allocation-vcores
... View more
06-21-2016
11:35 PM
This looks strange. Your console output listed the below lines com.databricks#spark-avro_2.10 added as a dependency
org.apache.avro#avro-mapred added as a dependency Can you try once with : --packages com.databricks:spark-avro_2.10:1.0.0,org.apache.avro:avro-mapred:1.6.3 I can sense some version compatibility issues of avro-mapred with spark-avro.
... View more
06-15-2016
04:55 AM
Try starting spark-shell with following packages: --packages com.databricks:spark-avro_2.10:2.0.1,org.apache.avro:avro-mapred:1.7.7
... View more
06-10-2016
03:46 AM
What Python version you are using. You may want to refer : http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_ipython.html
... View more
05-30-2016
03:48 AM
1 Kudo
See the Environment tab of Job History UI and locate " spark.local.dir". Yes that is the expected behaviour as JAR is required to the executors.
... View more
05-30-2016
01:05 AM
This looks weird. And can you confirm that http://192.168.88.28:55310/jars/phoenix-1.2.0-client.jar is still not present? Spark keeps all JARs specified by --jars option in job's temp directory on each executor nodes [1]. There must be some sort of OS settings which lead the deletion of existing phoenix jar from temp and when Spark Context is unable to find it at its usual location it tries to download it from the given location. However this should not happen until the temp directory is actively accessed by the job or process. You can try bundling that JAR with your Spark JAR and then refer it in spark-submit. I suspect, you will need again 20 odd days to test this workaround 🙂
... View more
05-11-2016
03:47 AM
1 Kudo
You are messing with createPollingStream method. Give 198.168.1.31 as sink address as below and it should work. FlumeUtils.createPollingStream(ssc,"198.168.1.31",8020)
... View more
05-11-2016
01:52 AM
Add below dependency as well: groupId = org.apache.spark
artifactId = spark-streaming-flume_2.10
version = 1.6.1 See here for pull based configuration.
... View more
04-20-2016
03:29 AM
1 Kudo
CM is supporting single version for Spark on YARN and single version for Standalone installation (Single version is common requirement). For supporting multiple versions of Spark you need to install it manually on a single node and copy the config files for YARN and Hive inside its conf directory. And when you refer the spark-submit of that version, it will distribute the Spark-core binary on each YARN nodes to execute your code. You don't need to install Spark on each YARN nodes.
... View more
04-19-2016
04:31 AM
Yes, YARN provides this flexibility. Here you can find the detailed answer. For CDH there is a "Spark" service, which meant for YARN and another is "Spark Standalone" service which runs it's daemons standalone on the specified nodes. YARN will do the work for you if you want to test the multiple versions simultaneously. You should have your multiple versions on Gateway Host and then you can launch Spark applications from there.
... View more
04-07-2016
04:26 AM
1 Kudo
Thats because you have no new files arriving in the directory after streaming application starts. You can try "cp" to drop files in the directory after starting the streaming application.
... View more
04-07-2016
01:57 AM
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.ThreadUtils$.runInNewThread$default$2()Z Compare your code with below line: .setMaster("local[2]") BTW which version of Spark Streaming you are usning?
... View more
04-06-2016
05:10 AM
You need to assign number of threads to spark while running master on local, most obvious choice is 2, 1 to recieve the data and 1 to process them. so the correct code should be : .setMaster("local[2]") If your file is not too big change to : val ssc = new StreamingContext(sc, Seconds(1)) You have stopped the streaming but forgot to start it: file.foreachRDD(t=> {
val test = t.map(x => (x.split(" ")(0)+";"+x.split(" ")(1), 1)).reduceByKey((x,y) => x+y)
test.saveAsTextFile("/root/file/file1")
}) sc.start() sc.awaitTermination() As of now dont use sc.stop()
... View more
04-06-2016
04:54 AM
Seems that there is some glitch in your code. It would be much easy if you could post your code.
... View more
04-06-2016
03:38 AM
From your code : val textFile = sc.textFileStream( "/root/file/test" ) textFile.foreachRDD(t=> { val test = t.map(x => (x.split( " " )( 0 )+ ";" +x.split( " " )( 1 ) , 1 )).reduceByKey((x , y) => x+y) test.saveAsTextFile( "/root/file/file1" ) ; }) Mind the t.map( ) not file.map( )
... View more
04-05-2016
11:51 PM
You have a handy method bundled with Spark "foreachRDD": val file = ssc.textFileStream("/root/file/test") file.foreachRDD(t=> { var test=t.map() //DO the map stuff here test.saveAsTextFiles("/root/file/file1") }) sc.stop()
... View more