About AutoIN

AutoIN · ‎12-27-2017

Spark2.x comes bundled with its own scala (version 2.11). You do NOT need to install scala 2.11 separately or upgrade your existing scala 2.10 version. The Spark 2 installation will take care of the scala version for you. Once you install Spark2 (just ensure to review the pre-requisites and known issues.) you can find Scala 2.11 libraries under /opt/cloudera/parcels/SPARK2/lib/spark2/jars # ls -l /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala* -rw-r--r-- 1 root root 15487351 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-compiler-2.11.8.jar -rw-r--r-- 1 root root 5744974 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-library-2.11.8.jar -rw-r--r-- 1 root root 423753 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-parser-combinators_2.11-1.0.4.jar -rw-r--r-- 1 root root 4573750 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-reflect-2.11.8.jar -rw-r--r-- 1 root root 648678 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-xml_2.11-1.0.2.jar -rw-r--r-- 1 root root 802818 Jul 12 19:16 /opt/cloudera/parcels/SPARK2/lib/spark2/jars/scalap-2.11.8.jar The reason both Spark1.6 and Spark2.x can coexist is attributed to them having separate parcels and have separate ways of calling. Example to run an application with Spark2, you need to use spark2-shell, spark2-submit, or pyspark2. Likewise, if you want to run an application using Spark1.6 (CDH bundled), you need to use spark-shell, spark-submit, or pyspark.

AutoIN · ‎11-25-2017

Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below: scala> df1.show() +----+--------+----+ | id| name| age| +----+--------+----+ |1201| satish|39 | |1202| krishna|null| << |1203| amith|47 | |1204| javed|null| << |1205| prudvi|null| << +----+--------+----+ scala> val arr = udf(() => scala.util.Random.nextInt(10).toString()) scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr())) scala> df2.show() +----+--------+---+ | id| name|age| +----+--------+---+ |1201| satish| 39| |1202| krishna| 2 | << |1203| amith| 47| |1204| javed| 9 | << |1205| prudvi| 7 | << +----+--------+---+

AutoIN · ‎11-24-2017

The problem is running the LOAD query with OVERWRITE option and having the source data file (location where the CSV file is placed) being in the same directory as the table is located in. Unable to move source hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv to destination hdfs://quickstart.cloudera:8020/user/data/stocks/stocks.csv The solution would be to move the source data file into a different hdfs directory and load the data into the table from there or alternatively, if the table is newly created you can leave the overwrite part out of the query. Note: In general, if your data is already there in table's location, you don't need to load data again, you can simply define the table using the external keyword, which leaves the files in place, but creates the table definition in the hive metastore. Example: $ cat /tmp/sample.txt 1 a 2 b 3 c $ hdfs dfs -mkdir /data1 $ hdfs dfs -chown hive:hive /data1 $ hdfs dfs -cp /tmp/sample.txt /data1 $ hive hive> CREATE EXTERNAL TABLE weather6 (col1 INT, col2 STRING) > COMMENT 'Employee details' > ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' > STORED AS TEXTFILE > LOCATION '/data1'; hive> select * from weather6; OK 1 a 2 b 3 c

AutoIN · ‎11-17-2017

SQL provides function "rand" for random number generation. In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps. scala> df.show() +----+-----+ |col1| col2| +----+-----+ |Co |Place| |null| a1 | |null| a2 | +----+-----+ scala> val newDF= df.na.fill(1.0, Seq("col1")) scala> newDF.show() +----+-----+ |col1| col2| +----+-----+ | Co |Place| | 1 | a1 | | 1 | a2 | +----+-----+ https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

AutoIN · ‎11-17-2017

1. Ensure that the host from where you are running spark-shell or spark2-shell has the corresponding Spark gateway role enabled. - Login to the CM WebUI - Go to Spark2/Spark service - click on the instances tab - ensure that the Gateway role for host is there. If not, us Add roles to add it. 2. Ensure that you have selected Hive service in the Spark configuration. - Login to CM WebUI - Go to Spark2/Spark service - click on the configuration tab - in the search box type in hive - enable the service and redeploy the client and the stale configuration. 3. Once done, open spark shell and the hive context should already be there in the form of sqlContext variable. The example below shows a very basic SQL query on a hive table 'sample_07' which contains sample employee data with 4 columns. A transformation was applied using filter and then the resultant transformation was saved as a text file in HDFS. $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc (master = yarn-client, app id = application_1510839440070_0006). SQL context available as sqlContext. scala> sqlContext.sql("show databases").show() +-------+ | result| +-------+ |default| +-------+ scala> sqlContext.sql("show tables").show() +----------+-----------+ | tableName|isTemporary| +----------+-----------+ |hive_table| false| | sample_07| false| | sample_08| false| | web_logs| false| +----------+-----------+ scala> val df_07 = sqlContext.sql("SELECT * from sample_07") scala> df_07.filter(df_07("salary") > 150000).show() +-------+--------------------+---------+------+ | code| description|total_emp|salary| +-------+--------------------+---------+------+ |11-1011| Chief executives| 299160|151370| |29-1022|Oral and maxillof...| 5040|178440| |29-1023| Orthodontists| 5350|185340| |29-1024| Prosthodontists| 380|169360| |29-1061| Anesthesiologists| 31030|192780| |29-1062|Family and genera...| 113250|153640| |29-1063| Internists, general| 46260|167270| |29-1064|Obstetricians and...| 21340|183600| |29-1067| Surgeons| 50260|191410| |29-1069|Physicians and su...| 237400|155150| +-------+--------------------+---------+------+ scala> df_07.filter(df_07("salary") > 150000).rdd.coalesce(1).saveAsTextFile("/tmp/c") Note: This might not be the most elegant to store the transformed dataframe, but would work for testing. There are other ways to save the transformation as well and since we are talking about columns and dataframes, you might want to consider saving it as CSV using spark-csv library or even better in parquet format. Once saved, you can query the resultant file from HDFS and transfer it locally (if needed). [root@nightly ~]# hdfs dfs -ls /tmp/c Found 2 items -rw-r--r-- 2 systest supergroup 0 2017-11-17 02:41 /tmp/c/_SUCCESS -rw-r--r-- 2 systest supergroup 455 2017-11-17 02:41 /tmp/c/part-00000 [root@nightly511-unsecure-1 ~]# hdfs dfs -cat /tmp/c/part-00000 [11-1011,Chief executives,299160,151370] [29-1022,Oral and maxillofacial surgeons,5040,178440] [29-1023,Orthodontists,5350,185340] [29-1024,Prosthodontists,380,169360] [29-1061,Anesthesiologists,31030,192780] [29-1062,Family and general practitioners,113250,153640] [29-1063,Internists, general,46260,167270] [29-1064,Obstetricians and gynecologists,21340,183600] [29-1067,Surgeons,50260,191410] [29-1069,Physicians and surgeons, all other,237400,155150] [root@nightly ~]# hdfs dfs -get /tmp/c/part-00000 result.txt root@nightly ~]# ls result.txt Reference: https://www.cloudera.com/documentation/enterprise/latest/topics/spark_sparksql.html#spark_sql_example Let us know if you've any other questions. Good Luck!

AutoIN · ‎11-14-2017

This is because 'saveAsTable()' doesn't currently works with Hive. It's documented here: https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html#ki_sparksql_dataframe_saveastable More context: This was recently reported with Spark2.2 and we are working internally to test a fix, however, I don't have a timeline on when/which future release will this be fixed. Till then, please see if the workaround in the above doc helps.

AutoIN · ‎08-03-2017

Having an SHS (Spark2 History Server) role or not having a ZK (zookeeper) role shouldn't affect the spark job. All we require is a Spark2 gateway role on the node from where you're running the spark2 job. Given other nodes are able to launch the same job, odds are high that we have a problem with client configuration or classpath on this node particularly. BTW, the spark2-conf looks fine. Can you please help confirm if you are able to run a simple spark pi job, or does that fails too with the same message? $ spark2-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 10 10 17/07/10 04:14:53 INFO spark.SparkContext: Running Spark version 2.1.0.cloudera1 .... Pi is roughly 3.1397831397831397

AutoIN · ‎08-03-2017

Caused by: java.lang.ClassNotFoundException: org.slf4j.impl.StaticLoggerBinder ^ This is generally an indication of a non-existing or incorrect Hadoop/Spark2 client configuration I'd make sure that the Spark2 gateway role is added to the node from where you're running spark2-submit. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html |_When configuring the assignment of role instances to hosts, add a gateway role to every host If you've already done that, please share the output of: # alternatives --display spark2-conf Example from a working node # alternatives --display spark2-conf spark2-conf - status is auto. link currently points to /etc/spark2/conf.cloudera.spark2_on_yarn /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/etc/spark2/conf.dist - priority 10 /etc/spark2/conf.cloudera.spark2_on_yarn - priority 51 Current `best' version is /etc/spark2/conf.cloudera.spark2_on_yarn. # grep slf4j-log4j /etc/spark2/conf.cloudera.spark2_on_yarn/classpath.txt /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar # ls -l /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar -rw-r--r-- 1 root root 8869 Apr 5 21:44 /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars/slf4j-log4j12-1.7.5.jar

AutoIN · ‎07-27-2017

If I understand it correctly - you are able to get past the earlier messages complaining about "Yarn application has already ended!" and now when you try to run pyspark2 it gives you a shell prompt, however, running simple commands to convert a list of strings to all upper case results in containers getting killed with Exit Status:1 . $ pyspark2 Using Python version 3.6.1 (default, Jul 27 2017 11:07:01) sparkSession available as 'spark'. >>> strings=['old'] >>> s2=sc.parallelize(strings) >>> s3=s2.map(lambda x:x.upper()) >>> s3.collect() [Stage 0:> (0 + 0) / 2]17/07/27 14:52:18 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1501131033996_0005_01_000002 on host: Slave3. Exit status: 1. To review why the application failed, we need to look at the container logs. Container logs are available from command line by running yarn logs command # yarn logs -applicationId <application ID> | less OR Cloudera Manager > Yarn > WebUI > Resource Manager WebUI > application_1501131033996_0005 > Check for Logs at the bottom > stderr and stdout

AutoIN · ‎07-26-2017

+1 @mbigelow This (not seeing Spark2 in 'Add a Service' wizard) is generally a result of Cloudera Management Services not being restarted (or CM not recognizing the CSD)

Online	Offline
Last Visited	‎06-16-2022 06:25 AM

Member Since	‎11-16-2015 10:11 PM
Last Visited	‎06-16-2022 06:25 AM
Posts	195
Kudos received	36

Cloudera Community

Re: Problem starting CDSW sessions after deleting ...

Re: cdsw containers crashing

Re: cdsw -tcp-ingress controller failing..grpc_sta...

Re: CDSW 1.6 Release Date and OS Requirements

Re: Allocating more than 50% of memory in cdsw

Re: How to upgrade Scala version from 2.10.5 to 2....

Re: Generating unique Ids for hive table using Sca...

Re: return code 1 from org.apache.hadoop.hive.ql.e...

Re: Generating unique Ids for hive table using Sca...

Re: Generating File by reading Hive data in Spark

Re: Spark 2 Can't write dataframe to parquet table

Re: java.lang.NoClassDefFoundError: org/slf4j/impl...

Re: java.lang.NoClassDefFoundError: org/slf4j/impl...

Re: Failed to initialize pyspark2.2 in CDH5.12

Re: Having Spark 1.6.0 and 2.1 in the same CDH