Member since
11-16-2015
195
Posts
36
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
997 | 10-23-2019 08:44 PM | |
1332 | 09-18-2019 09:48 AM | |
3572 | 09-18-2019 09:37 AM | |
1089 | 07-16-2019 10:58 AM | |
1674 | 04-05-2019 12:06 AM |
05-01-2018
07:46 PM
@JSenzier are you looking for any particular fix/Jira in Spark1.6.3 which is not there in CDH Spark 1.6? It's true that the latest we've in CDH is - Spark1.6 but that is not an exact replica of base (upstream) Spark1.6. Here is a list of spark bugs fixed in CDH 5.14 on top of the base version 1.6.
... View more
05-01-2018
07:22 PM
1 Kudo
Thanks for reporting. Care to share the full error for the lineage file missing, please? I quickly tested an upgrade from 2.2 to 2.3 but didn't hit this. A full error stack trace would certainly help.
... View more
05-01-2018
05:19 AM
4 Kudos
@rams the error is correct as the syntax in pyspark varies from that of scala. For reference here are the steps that you'd need to query a kudu table in pyspark2 Create a kudu table using impala-shell # impala-shell CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; insert into test_kudu values (100, 'abc'); insert into test_kudu values (101, 'def'); insert into test_kudu values (102, 'ghi'); Launch pyspark2 with the artifacts and query the kudu table # pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'. >>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load() >>> kuduDF.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+ For records, the same thing can be achieved using the following commands in spark2-shell # spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT scala> import org.apache.kudu.spark.kudu._ import org.apache.kudu.spark.kudu._ scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu scala> df.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+
... View more
04-25-2018
08:49 AM
Try this: http://site.clairvoyantsoft.com/installing-sparkr-on-a-hadoop-cluster/
... View more
04-15-2018
11:17 PM
@hedy thanks for sharing. The workaround you received makes sense when you are not using any cluster manager(?) Local mode ( --master local[i] ) is generally seen if you want to test or debug something quickly since there will be only one JVM launched on the node from where you are running pyspark and this JVM will act as driver, executor, and master -> all-in-one. But of course with local mode, you lose the scalability and resource management that a cluster manager provides. If you want to debug why simultaneous spark shells are not working when using Spark-On-Yarn, we need to diagnose this from YARN perspective (troubleshooting steps shared in the last post). Let us know.
... View more
04-11-2018
06:05 AM
2 Kudos
If the question is academic in nature then certainly, you can. If it's instead a use-case and if I were to choose between Sqoop and SparkSQL, I'd stick with Sqoop. The reason being Sqoop comes with a lot of connectors which it has direct access to, while Spark JDBC will typically be going in via plain old JDBC and so will be substantially slower and put more load on the target DB. You can also see partition size constraints while extracting data. So, performance and management would certainly be a key in deciding the solution. Good Luck and let us know which one did you finally prefer and how was your experience. Thx
... View more
04-11-2018
05:45 AM
Welcome, Eswar! Since you are using Spark1.6 all you'd need is a hive gateway to explore hive tables from spark sql (no need to manually transport hive-site.xml). You can add/ensure that the Hive gateway is added to the node from where you are running the spark-shell (in your case there is just one node so it should be your quickstart VM) using CM > Hive > Instances > Gateway Role As for your requirement of a sample code, you can start by creating a sequence or an array from the shell scala> val data = Seq(("Falcon", 10), ("IronMan", 40), ("BlackWidow", 10)) Next, parallelize the collection and create a DataFrame from the RDD scala> val df = sc.parallelize(data).toDF("Name", "Count") After this set the Hive warehouse path scala> val options = Map("path" -> "/user/hive/warehouse/avengers") Followed by saving the table scala> df.write.options(options).saveAsTable("default.avengers") Finally, query the table using Spark SQL and beeline scala> sqlContext.sql("select * from avengers").collect.foreach(println);
[Falcon, 30]
[IronMan, 40]
[BlackWidow, 10] $ beeline …
> show tables;
> select * from avengers; Falcon 30
IronMan 40
BlackWidow 10 Hope this helps. Let us know if you already got past it and/or if you are still stuck. Good Luck!
... View more
04-11-2018
03:45 AM
Interesting. Since you are running in local mode, have you already tried adjusting the threads local[N] (i.e increasing or decreasing the value of N). Also, how many logical cores do you have on the server? It will be good to know what is the program doing and how big is the dataset?
... View more
04-11-2018
03:33 AM
The closest you could get is using Node Labels (when it gets implemented in CDH). This feature in YARN lets you specify the worker nodes (NodeManagers) that you'd want the application containers to run on. However, it's not out yet and work is still going on to incorporate this.
... View more
04-10-2018
11:08 PM
1 Kudo
Sorry, this is a bug described in SPARK-22876 which suggests that the current logic of spark.yarn.am.attemptFailuresValidityInterva l is flawed. While the jira is still being worked on, looking at the comments, I don't foresee a fix anytime soon.
... View more
04-10-2018
09:37 PM
2 Kudos
WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. ^ This generally means that the problem is beyond the port mapping ( i.e either with queue configuration/ available resources/YARN level). Assuming that you are using spark1.6, I'd suggest to temporarily change the shell logging level to INFO and see if that gives a hint. The easy and quick way to do this would be to edit /etc/spark/conf/log4j.properties from the node you are running pyspark and modify the log level from WARN to INFO. # vi /etc/spark/conf/log4j.properties
shell.log.level=INFO
$ spark-shell .... 18/04/10 20:40:50 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 18/04/10 20:40:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4041. 18/04/10 20:40:50 INFO client.RMProxy: Connecting to ResourceManager at host-xxx.cloudera.com/10.xx.xx.xx:8032 18/04/10 20:40:52 INFO impl.YarnClientImpl: Submitted application application_1522940183682_0060 18/04/10 20:40:54 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:55 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:56 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:57 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) Next, open the Resource Manager UI and check the state of the Application (i.e your second invocation of pyspark) -- whether it's is registered but just stuck in ACCEPTED state like this: If yes, look at the Cluster Metrics row at the top of the RM UI page and see if there are enough resources available: Now kill the first pyspark session and check if the second session changes the state RUNNING in the RM UI. If yes, look at the queue placement rules and stats in Cloudera Manager > Yarn > Resource Pools Usage (and Configuration) Hopefully, this would give us some more clues. Let us know how it goes? Feel free to share the screen-shots from the RM UI and spark-shell INFO logging.
... View more
12-27-2017
02:50 AM
The file prefix looks interesting. To me, it suggests that the SHS (Spark2-History-Server) is trying to look for a log directory which is local to the host where SHS is supposed to run (file:/) and not within HDFS (hdfs:/) Caused by: java.io.FileNotFoundException: File file:/user/spark/spark2ApplicationHistory does not exist Can you share the values for: Cloudera Manager > Spark2 > Configuration > spark.eventLog.dir $ cat /var/run/cloudera-scm-agent/process/75-spark2_on_yarn-SPARK2_YARN_HISTORY_SERVER/spark2-conf/spark-history-server.conf $ hdfs dfs -ls /user/spark
... View more
12-27-2017
01:53 AM
1 Kudo
Spark2.x comes bundled with its own scala (version 2.11). You do NOT need to install scala 2.11 separately or upgrade your existing scala 2.10 version. The Spark 2 installation will take care of the scala version for you. Once you install Spark2 (just ensure to review the pre-requisites and known issues.) you can find Scala 2.11 libraries under /opt/cloudera/parcels/SPARK2/lib/spark2/jars # ls -l /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala*
-rw-r--r-- 1 root root 15487351 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-compiler-2.11.8.jar
-rw-r--r-- 1 root root 5744974 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-library-2.11.8.jar
-rw-r--r-- 1 root root 423753 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-parser-combinators_2.11-1.0.4.jar
-rw-r--r-- 1 root root 4573750 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-reflect-2.11.8.jar
-rw-r--r-- 1 root root 648678 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-xml_2.11-1.0.2.jar
-rw-r--r-- 1 root root 802818 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scalap-2.11.8.jar The reason both Spark1.6 and Spark2.x can coexist is attributed to them having separate parcels and have separate ways of calling. Example to run an application with Spark2, you need to use spark2-shell, spark2-submit, or pyspark2. Likewise, if you want to run an application using Spark1.6 (CDH bundled), you need to use spark-shell, spark-submit, or pyspark.
... View more
12-10-2017
10:11 PM
What you are observing is true since Hive-On-Spark currently works only with CDH bundled Spark1.6 and not Spark2.0 . Hive-on-Spark2 It is targeted for supportability starting CDH 6.x, however there is no definitive ETA for it's GA at the moment. https://www.cloudera.com/documentation/spark2/2-0-x/topics/spark2_known_issues.html#hive_on_spark
... View more
11-28-2017
08:51 AM
A similar query was asked a while back. Please see here for more details and the workarounds.
... View more
11-25-2017
10:40 PM
Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below: scala> df1.show()
+----+--------+----+
| id| name| age|
+----+--------+----+
|1201| satish|39 |
|1202| krishna|null| <<
|1203| amith|47 |
|1204| javed|null| <<
|1205| prudvi|null| <<
+----+--------+----+
scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())
scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))
scala> df2.show()
+----+--------+---+
| id| name|age|
+----+--------+---+
|1201| satish| 39|
|1202| krishna| 2 | <<
|1203| amith| 47|
|1204| javed| 9 | <<
|1205| prudvi| 7 | <<
+----+--------+---+
... View more
11-24-2017
03:44 AM
2 Kudos
Can "Spark" stay with "Spark2" at same time? by the error 'java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream' I understand that library is not found in some directory. Yes, spark(1.x) and spark2 can coexist. spark2 binaries are wrapped separately as spark2-shell, spark2-submit, pyspark2. Both the services are configured to not conflict and run on the same YARN cluster. The error 'java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream' simply means that the client configuration couldn't be found on the host from where you are invoking spark2-shell Just now I have found something important, this directory is EMPTY :
/opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/etc/spark2/conf.dist Right, I double checked it on a working host. This is empty. It is the directory that command "alternatives --display spark2-conf" shows as 'best version'. I guess that deploy is not working. Right. This gets changed to /etc/spark2/conf.cloudera.spark2* once the spark2 client configuration are correctly deployed. Installation is correct but fails (obviously because this parcel is for cdh 5.12) 'spark2-shell':
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 52.0 Well, spark2.2 does works on CDH 5.8 and above, however, this message just means that the system couldn't find java8 as default (sorry the message itself is not clear). Please see: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_requirements.html For a list of supported JDK8 version (and recommended) please see the documentation And the way to install and configure JDK8 in your CDH cluster Once done, you can do: $ export JAVA_HOME=/usr/java/jdk1.8.0_121 $ spark2-shell BTW, it would be worth if you could share full stdout.log and last few lines of stderr.log from the client configuration deployment directory / var/run/cloudera-scm-agent/process/ccdeploy_spark-conf_etcsparkconf.cloudera.spark_on_yarn2_-22632408179870636/logs
... View more
11-24-2017
12:10 AM
2 Kudos
The problem is running the LOAD query with OVERWRITE option and having the source data file (location where the CSV file is placed) being in the same directory as the table is located in. Unable to move
source
hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv to
destination
hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv The solution would be to move the source data file into a different hdfs directory and load the data into the table from there or alternatively, if the table is newly created you can leave the overwrite part out of the query. Note: In general, if your data is already there in table's location, you don't need to load data again, you can simply define the table using the external keyword, which leaves the files in place, but creates the table definition in the hive metastore. Example: $ cat /tmp/sample.txt
1 a
2 b
3 c
$ hdfs dfs -mkdir /data1
$ hdfs dfs -chown hive:hive /data1
$ hdfs dfs -cp /tmp/sample.txt /data1
$ hive
hive> CREATE EXTERNAL TABLE weather6 (col1 INT, col2 STRING)
> COMMENT 'Employee details'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
> STORED AS TEXTFILE
> LOCATION '/data1';
hive> select * from weather6;
OK
1 a
2 b
3 c
... View more
11-23-2017
07:52 PM
Thank you. The gateway role instances seem fine. However, it's evident that the spark2 client configurations on the gateway node are not deployed (though I understand you don't see any errors while deploying it from CM). Could you please help double check the latest client configuration deployment logs on the host (eg node-r3). You'd find them /var/run/cloudera-scm-agent/process/ccdeploy_spark2-conf* Example: $ cd /var/run/cloudera-scm-agent/process/ccdeploy_spark2-conf_etcspark2conf.cloudera.spark2_on_yarn_-3339029592499272274/logs $ stderr.log ...
+ cp -a /run/cloudera-scm-agent/process/ccdeploy_spark2-conf_etcspark2conf.cloudera.spark2_on_yarn_-3339029592499272274/spark2-conf /etc/spark2/conf.cloudera.spark2_on_yarn
+ chown root /etc/spark2/conf.cloudera.spark2_on_yarn
+ chmod -R ugo+r /etc/spark2/conf.cloudera.spark2_on_yarn
+ '[' -e /etc/spark2/conf.cloudera.spark2_on_yarn/topology.py ']'
+ /usr/sbin/update-alternatives --install /etc/spark2/conf spark2-conf /etc/spark2/conf.cloudera.spark2_on_yarn 51
+ /usr/sbin/update-alternatives --auto spark2-conf $ stdout.log ....
using /usr/sbin/update-alternatives as UPDATE_ALTERNATIVES
Deploying service client configs to /etc/spark2/conf.cloudera.spark2_on_yarn
invoking optional deploy script scripts/control.sh
/run/cloudera-scm-agent/process/ccdeploy_spark2-conf_etcspark2conf.cloudera.spark2_on_yarn_-3339029592499272274/spark2-conf /run/cloudera-scm-agent/process/ccdeploy_spark2-conf_etcspark2conf.cloudera.spark2_on_yarn_-3339029592499272274
Thu Nov 23 19:09:19 PST 2017: Running Spark2 CSD control script...
Thu Nov 23 19:09:19 PST 2017: Detected CDH_VERSION of [5]
Thu Nov 23 19:09:19 PST 2017: Deploying client configuration
deploy script exited with 0
/run/cloudera-scm-agent/process/ccdeploy_spark2-conf_etcspark2conf.cloudera.spark2_on_yarn_-3339029592499272274 Let us know if you see any errors or exceptions in these logs.
... View more
11-22-2017
03:26 AM
Object sharing between different spark-submit jobs is not there currently. However, it immensely helps if we know your use-case in as much detail as possible and the problem you are trying to solve with sharing dataframes. My understanding is i f the data changes infrequently and caching is a must have, you can use HDFS caching. If the data changes often i.e. records will constantly be updated and the data has to be shared among many different applications: use Kudu. Kudu already has basic caching capabilities where frequently read subsets of data are automatically cached. There was a previous thread awhile back around the same lines and some options that you could explore (though unsupported) are Spark-JobServer or Tachyon. However, I have not used them and can't comment beyond the references.
... View more
11-21-2017
08:12 AM
This generally happens when Hive service is not enabled for the Spark2. Please ensure that you've selected the Hive Service dependency on the Spark2: - login to CM WebUI - go to Spark2 service - click on the configuration tab - in the search box type in hive - if it's set to none, select the hive service and redeploy the client and the stale configuration. The default database is 'default'. Are your tables (a1,a2,a3) a part of 'default' database in hive or are they created on another database? If you do a "show databases" from spark sql does it lists all the DB's or just default?
... View more
11-17-2017
05:21 AM
SQL provides function "rand" for random number generation. In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps. scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co |Place|
|null| a1 |
|null| a2 |
+----+-----+
scala> val newDF= df.na.fill(1.0, Seq("col1"))
scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1 | a1 |
| 1 | a2 |
+----+-----+ https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
... View more
11-17-2017
03:07 AM
1. Ensure that the host from where you are running spark-shell or spark2-shell has the corresponding Spark gateway role enabled. - Login to the CM WebUI - Go to Spark2/Spark service - click on the instances tab - ensure that the Gateway role for host is there. If not, us Add roles to add it. 2. Ensure that you have selected Hive service in the Spark configuration. - Login to CM WebUI - Go to Spark2/Spark service - click on the configuration tab - in the search box type in hive - enable the service and redeploy the client and the stale configuration. 3. Once done, open spark shell and the hive context should already be there in the form of sqlContext variable. The example below shows a very basic SQL query on a hive table 'sample_07' which contains sample employee data with 4 columns. A transformation was applied using filter and then the resultant transformation was saved as a text file in HDFS. $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc (master = yarn-client, app id = application_1510839440070_0006). SQL context available as sqlContext. scala> sqlContext.sql("show databases").show() +-------+ | result| +-------+ |default| +-------+ scala> sqlContext.sql("show tables").show() +----------+-----------+ | tableName|isTemporary| +----------+-----------+ |hive_table| false| | sample_07| false| | sample_08| false| | web_logs| false| +----------+-----------+ scala> val df_07 = sqlContext.sql("SELECT * from sample_07") scala> df_07.filter(df_07("salary") > 150000).show() +-------+--------------------+---------+------+ | code| description|total_emp|salary| +-------+--------------------+---------+------+ |11-1011| Chief executives| 299160|151370| |29-1022|Oral and maxillof...| 5040|178440| |29-1023| Orthodontists| 5350|185340| |29-1024| Prosthodontists| 380|169360| |29-1061| Anesthesiologists| 31030|192780| |29-1062|Family and genera...| 113250|153640| |29-1063| Internists, general| 46260|167270| |29-1064|Obstetricians and...| 21340|183600| |29-1067| Surgeons| 50260|191410| |29-1069|Physicians and su...| 237400|155150| +-------+--------------------+---------+------+ scala> df_07.filter(df_07("salary") > 150000).rdd.coalesce(1).saveAsTextFile("/tmp/c") Note: This might not be the most elegant to store the transformed dataframe, but would work for testing. There are other ways to save the transformation as well and since we are talking about columns and dataframes, you might want to consider saving it as CSV using spark-csv library or even better in parquet format. Once saved, you can query the resultant file from HDFS and transfer it locally (if needed). [root@nightly ~]# hdfs dfs -ls /tmp/c Found 2 items -rw-r--r-- 2 systest supergroup 0 2017-11-17 02:41 /tmp/c/_SUCCESS -rw-r--r-- 2 systest supergroup 455 2017-11-17 02:41 /tmp/c/part-00000 [root@nightly511-unsecure-1 ~]# hdfs dfs -cat /tmp/c/part-00000 [11-1011,Chief executives,299160,151370] [29-1022,Oral and maxillofacial surgeons,5040,178440] [29-1023,Orthodontists,5350,185340] [29-1024,Prosthodontists,380,169360] [29-1061,Anesthesiologists,31030,192780] [29-1062,Family and general practitioners,113250,153640] [29-1063,Internists, general,46260,167270] [29-1064,Obstetricians and gynecologists,21340,183600] [29-1067,Surgeons,50260,191410] [29-1069,Physicians and surgeons, all other,237400,155150] [root@nightly ~]# hdfs dfs -get /tmp/c/part-00000 result.txt root@nightly ~]# ls result.txt Reference: https://www.cloudera.com/documentation/enterprise/latest/topics/spark_sparksql.html#spark_sql_example Let us know if you've any other questions. Good Luck!
... View more
11-14-2017
06:45 AM
This is because ' saveAsTable()' doesn't currently works with Hive. It's documented here: https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html#ki_sparksql_dataframe_saveastable More context: This was recently reported with Spark2.2 and we are working internally to test a fix, however, I don't have a timeline on when/which future release will this be fixed. Till then, please see if the workaround in the above doc helps.
... View more
11-14-2017
06:30 AM
link currently points to /opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/etc/spark2/conf.dist Hmm, that is our problem. If the spark2 gateway instance and the client configurations are rightly deployed then the link would automatically point to / etc/spark2/conf.cloudera.spark2_on_yarn. Could you possibly share a print screen of the CM UI > Home page (just displaying all the services) and a print-screen of CM UI > Spark2_on_yarn> Instances ? Does redeploying client-configurations from CM UI > Cluster name (drop down icon) > Deploy Client Configuration shows all okay or is there an error? From host 'node-r3' can you also share $ ls -l /etc/spark2/ to see if there exists a directory /etc/spark2/conf.cloudera.spark2_on_yarn BTW, I believe you can just run spark2-shell command instead of going into the spark2 parcel bin directory and launching ./spark2-shell
... View more
11-10-2017
06:40 PM
This error is almost always a result of not having Spark2 gateway role configured on the host from where you're trying to run spark2-shell (CM > Spark2 > Instances > Gateway). I'd ensure that the steps to add Spark2 service including CSD are correctly followed including a restart CM and CMS and would double check that the client configuration is correctly deployed (CM > Cluster Name Drop Down menu> Deploy Client Configuration). If all is well, you should see the alternatives pointing to the /etc/spark2/conf... (required for running spark2-shell) [u_m1@cm bin]# alternatives --display spark2-conf spark2-conf - status is auto. link currently points to /etc/spark2/conf.cloudera.spark2_on_yarn /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/etc/spark2/conf.dist - priority 10 /etc/spark2/conf.cloudera.spark2_on_yarn - priority 51 Current `best' version is /etc/spark2/conf.cloudera.spark2_on_yarn. Since you mentioned "service is added in ok status although cloudera could not deploy well " Can you share with us what was the error? Maybe you might want to remove the service from CM, re-add it by ensuring that service is configured according to the document? Let us know.
... View more
10-31-2017
10:06 PM
Hi Beniamino, I am not very sure about your 2nd question but regarding the expired kerberos ticket, this has been fixed in CDH 5.12+ or (depending on whether you are using java/python) you can call UserGroupInformation.checkTGTAndReloginFromKeytab() periodically, before each put per rdd partition.
... View more
10-31-2017
04:50 AM
Actually, what you are observing is documented here The following limitations apply to Spark applications that access HBase in a Kerberized cluster: The application must be restarted every seven days. This limitation is due to Spark-on-HBase not obtaining delegation tokens and is related to SPARK-12523. (Although this issue is resolved in Spark 2, Spark-on-HBase for Spark 2 is not supported with CDH.) Let us know if this answers your question.
... View more
08-24-2017
10:16 AM
You can use "--packages" with the shell to include any additional 3rd party packages you'd want. For graphframes , please refer to https://spark-packages.org/package/graphframes/graphframes, pick the suitable spark version and install it. Since you are on Spark ver 2.0 and assuming the scala ver 2.11, you should likely use 0.5.0-spark2.0-s_2.11 # pyspark2 --packages graphframes:graphframes:0.5.0-spark2.0-s_2.11
... View more
08-03-2017
07:29 PM
Having an SHS (Spark2 History Server) role or not having a ZK (zookeeper) role shouldn't affect the spark job. All we require is a Spark2 gateway role on the node from where you're running the spark2 job. Given other nodes are able to launch the same job, odds are high that we have a problem with client configuration or classpath on this node particularly. BTW, the spark2-conf looks fine. Can you please help confirm if you are able to run a simple spark pi job, or does that fails too with the same message? $ spark2-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 10 10 17/07/10 04:14:53 INFO spark.SparkContext: Running Spark version 2.1.0.cloudera1 .... Pi is roughly 3.1397831397831397
... View more
- « Previous
-
- 1
- 2
- Next »