Member since
11-16-2015
195
Posts
36
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2056 | 10-23-2019 08:44 PM | |
2148 | 09-18-2019 09:48 AM | |
8196 | 09-18-2019 09:37 AM | |
1875 | 07-16-2019 10:58 AM | |
2685 | 04-05-2019 12:06 AM |
12-27-2017
01:53 AM
1 Kudo
Spark2.x comes bundled with its own scala (version 2.11). You do NOT need to install scala 2.11 separately or upgrade your existing scala 2.10 version. The Spark 2 installation will take care of the scala version for you. Once you install Spark2 (just ensure to review the pre-requisites and known issues.) you can find Scala 2.11 libraries under /opt/cloudera/parcels/SPARK2/lib/spark2/jars # ls -l /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala*
-rw-r--r-- 1 root root 15487351 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-compiler-2.11.8.jar
-rw-r--r-- 1 root root 5744974 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-library-2.11.8.jar
-rw-r--r-- 1 root root 423753 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-parser-combinators_2.11-1.0.4.jar
-rw-r--r-- 1 root root 4573750 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-reflect-2.11.8.jar
-rw-r--r-- 1 root root 648678 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-xml_2.11-1.0.2.jar
-rw-r--r-- 1 root root 802818 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scalap-2.11.8.jar The reason both Spark1.6 and Spark2.x can coexist is attributed to them having separate parcels and have separate ways of calling. Example to run an application with Spark2, you need to use spark2-shell, spark2-submit, or pyspark2. Likewise, if you want to run an application using Spark1.6 (CDH bundled), you need to use spark-shell, spark-submit, or pyspark.
... View more
11-25-2017
10:40 PM
Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below: scala> df1.show()
+----+--------+----+
| id| name| age|
+----+--------+----+
|1201| satish|39 |
|1202| krishna|null| <<
|1203| amith|47 |
|1204| javed|null| <<
|1205| prudvi|null| <<
+----+--------+----+
scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())
scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))
scala> df2.show()
+----+--------+---+
| id| name|age|
+----+--------+---+
|1201| satish| 39|
|1202| krishna| 2 | <<
|1203| amith| 47|
|1204| javed| 9 | <<
|1205| prudvi| 7 | <<
+----+--------+---+
... View more
11-24-2017
12:10 AM
2 Kudos
The problem is running the LOAD query with OVERWRITE option and having the source data file (location where the CSV file is placed) being in the same directory as the table is located in. Unable to move
source
hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv to
destination
hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv The solution would be to move the source data file into a different hdfs directory and load the data into the table from there or alternatively, if the table is newly created you can leave the overwrite part out of the query. Note: In general, if your data is already there in table's location, you don't need to load data again, you can simply define the table using the external keyword, which leaves the files in place, but creates the table definition in the hive metastore. Example: $ cat /tmp/sample.txt
1 a
2 b
3 c
$ hdfs dfs -mkdir /data1
$ hdfs dfs -chown hive:hive /data1
$ hdfs dfs -cp /tmp/sample.txt /data1
$ hive
hive> CREATE EXTERNAL TABLE weather6 (col1 INT, col2 STRING)
> COMMENT 'Employee details'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
> STORED AS TEXTFILE
> LOCATION '/data1';
hive> select * from weather6;
OK
1 a
2 b
3 c
... View more
11-17-2017
05:21 AM
SQL provides function "rand" for random number generation. In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps. scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co |Place|
|null| a1 |
|null| a2 |
+----+-----+
scala> val newDF= df.na.fill(1.0, Seq("col1"))
scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1 | a1 |
| 1 | a2 |
+----+-----+ https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
... View more
11-17-2017
03:07 AM
1. Ensure that the host from where you are running spark-shell or spark2-shell has the corresponding Spark gateway role enabled. - Login to the CM WebUI - Go to Spark2/Spark service - click on the instances tab - ensure that the Gateway role for host is there. If not, us Add roles to add it. 2. Ensure that you have selected Hive service in the Spark configuration. - Login to CM WebUI - Go to Spark2/Spark service - click on the configuration tab - in the search box type in hive - enable the service and redeploy the client and the stale configuration. 3. Once done, open spark shell and the hive context should already be there in the form of sqlContext variable. The example below shows a very basic SQL query on a hive table 'sample_07' which contains sample employee data with 4 columns. A transformation was applied using filter and then the resultant transformation was saved as a text file in HDFS. $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc (master = yarn-client, app id = application_1510839440070_0006). SQL context available as sqlContext. scala> sqlContext.sql("show databases").show() +-------+ | result| +-------+ |default| +-------+ scala> sqlContext.sql("show tables").show() +----------+-----------+ | tableName|isTemporary| +----------+-----------+ |hive_table| false| | sample_07| false| | sample_08| false| | web_logs| false| +----------+-----------+ scala> val df_07 = sqlContext.sql("SELECT * from sample_07") scala> df_07.filter(df_07("salary") > 150000).show() +-------+--------------------+---------+------+ | code| description|total_emp|salary| +-------+--------------------+---------+------+ |11-1011| Chief executives| 299160|151370| |29-1022|Oral and maxillof...| 5040|178440| |29-1023| Orthodontists| 5350|185340| |29-1024| Prosthodontists| 380|169360| |29-1061| Anesthesiologists| 31030|192780| |29-1062|Family and genera...| 113250|153640| |29-1063| Internists, general| 46260|167270| |29-1064|Obstetricians and...| 21340|183600| |29-1067| Surgeons| 50260|191410| |29-1069|Physicians and su...| 237400|155150| +-------+--------------------+---------+------+ scala> df_07.filter(df_07("salary") > 150000).rdd.coalesce(1).saveAsTextFile("/tmp/c") Note: This might not be the most elegant to store the transformed dataframe, but would work for testing. There are other ways to save the transformation as well and since we are talking about columns and dataframes, you might want to consider saving it as CSV using spark-csv library or even better in parquet format. Once saved, you can query the resultant file from HDFS and transfer it locally (if needed). [root@nightly ~]# hdfs dfs -ls /tmp/c Found 2 items -rw-r--r-- 2 systest supergroup 0 2017-11-17 02:41 /tmp/c/_SUCCESS -rw-r--r-- 2 systest supergroup 455 2017-11-17 02:41 /tmp/c/part-00000 [root@nightly511-unsecure-1 ~]# hdfs dfs -cat /tmp/c/part-00000 [11-1011,Chief executives,299160,151370] [29-1022,Oral and maxillofacial surgeons,5040,178440] [29-1023,Orthodontists,5350,185340] [29-1024,Prosthodontists,380,169360] [29-1061,Anesthesiologists,31030,192780] [29-1062,Family and general practitioners,113250,153640] [29-1063,Internists, general,46260,167270] [29-1064,Obstetricians and gynecologists,21340,183600] [29-1067,Surgeons,50260,191410] [29-1069,Physicians and surgeons, all other,237400,155150] [root@nightly ~]# hdfs dfs -get /tmp/c/part-00000 result.txt root@nightly ~]# ls result.txt Reference: https://www.cloudera.com/documentation/enterprise/latest/topics/spark_sparksql.html#spark_sql_example Let us know if you've any other questions. Good Luck!
... View more
11-14-2017
06:45 AM
This is because 'saveAsTable()' doesn't currently works with Hive. It's documented here: https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html#ki_sparksql_dataframe_saveastable More context: This was recently reported with Spark2.2 and we are working internally to test a fix, however, I don't have a timeline on when/which future release will this be fixed. Till then, please see if the workaround in the above doc helps.
... View more
08-03-2017
07:29 PM
Having an SHS (Spark2 History Server) role or not having a ZK (zookeeper) role shouldn't affect the spark job. All we require is a Spark2 gateway role on the node from where you're running the spark2 job. Given other nodes are able to launch the same job, odds are high that we have a problem with client configuration or classpath on this node particularly. BTW, the spark2-conf looks fine. Can you please help confirm if you are able to run a simple spark pi job, or does that fails too with the same message? $ spark2-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 10 10 17/07/10 04:14:53 INFO spark.SparkContext: Running Spark version 2.1.0.cloudera1 .... Pi is roughly 3.1397831397831397
... View more
08-03-2017
12:15 AM
Caused by: java.lang.ClassNotFoundException: org.slf4j.impl.StaticLoggerBinder ^ This is generally an indication of a non-existing or incorrect Hadoop/Spark2 client configuration I'd make sure that the Spark2 gateway role is added to the node from where you're running spark2-submit. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html |_When configuring the assignment of role instances to hosts, add a gateway role to every host If you've already done that, please share the output of: # alternatives --display spark2-conf Example from a working node
# alternatives --display spark2-conf
spark2-conf - status is auto.
link currently points to /etc/spark2/conf.cloudera.spark2_on_yarn
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/etc/spark2/conf.dist - priority 10
/etc/spark2/conf.cloudera.spark2_on_yarn - priority 51
Current `best' version is /etc/spark2/conf.cloudera.spark2_on_yarn. # grep slf4j-log4j /etc/spark2/conf.cloudera.spark2_on_yarn/classpath.txt /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar # ls -l /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar -rw-r--r-- 1 root root 8869 Apr 5 21:44 /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar
... View more
07-27-2017
09:28 AM
If I understand it correctly - you are able to get past the earlier messages complaining about "Yarn application has already ended!" and now when you try to run pyspark2 it gives you a shell prompt, however, running simple commands to convert a list of strings to all upper case results in containers getting killed with Exit Status:1 . $ pyspark2
Using Python version 3.6.1 (default, Jul 27 2017 11:07:01)
sparkSession available as 'spark'.
>>> strings=['old']
>>> s2=sc.parallelize(strings)
>>> s3=s2.map(lambda x:x.upper())
>>> s3.collect()
[Stage 0:> (0 + 0) / 2]17/07/27 14:52:18 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1501131033996_0005_01_000002 on host: Slave3. Exit status: 1. To review why the application failed, we need to look at the container logs. Container logs are available from command line by running yarn logs command # yarn logs -applicationId <application ID> | less OR Cloudera Manager > Yarn > WebUI > Resource Manager WebUI > application_1501131033996_0005 > Check for Logs at the bottom > stderr and stdout
... View more
07-26-2017
11:36 PM
+1 @mbigelow This (not seeing Spark2 in 'Add a Service' wizard) is generally a result of Cloudera Management Services not being restarted (or CM not recognizing the CSD)
... View more
- « Previous
- Next »