About AutoIN

ArchenROOT · ‎05-31-2019

And I found a solution by pointint job.local.dir to directory with the code: spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('job.local.dir', 'file:/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() Now all works

Hemanth · ‎04-10-2019

Hi, Any update on this? Is this issue resolved? if yes, please let us know the solution, we are also facing the same issue.

Baris · ‎04-05-2019

Thank you for the great explanation @AutoIN. This solved my problem. On our CDSW cluster we have 2 nodes with a master and a slave. As described, I was able to figure out that the available cpu and memeory on both hosts are badly distributed. As an example, I'm able to spin an engine with a lot of vcpus but with little memory and vice versa. I was just not aware that a session can't share resources across nodes. Thank you very much!

sp088 · ‎03-08-2019

I am facing the same issue and can anyone please suggest how to resolve this. On running two spark application , one remains at accepted state while other is running. What is the configuration that needs to be done for this to be working? Following is the configuration for dynamic resource pool config: Please help!

AutoIN · ‎07-05-2018

@Rod No, it is unsupported (as of writing) in both CDH5 and CDH6. https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_unsupported_features.html#spark Spark SQL CLI is not supported

AutoIN · ‎06-11-2018

Just wanted to complete the thread here. This is now documented in the known issues section of the Spark2.3 documentation followed by workarounds to mitigate the error. Thx. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#concept_kgn_j3g_5db In CDS 2.3 release 2, Spark jobs fail when lineage is enabled because Cloudera Manager does not automatically create the associated lineage log directory (/var/log/spark2/lineage) on all required cluster hosts. Note that this feature is enabled by default in CDS 2.3 release 2. Implement one of the following workarounds to continue running Spark jobs. Workaround 1 - Deploy the Spark gateway role on all hosts that are running the YARN NodeManager role Cloudera Manager only creates the lineage log directory on hosts with Spark 2 roles deployed on them. However, this is not sufficient because the Spark driver can run on any host that is running a YARN NodeManager. To ensure Cloudera Manager creates the log directory, add the Spark 2 gateway role to every cluster host that is running the YARN NodeManager role. For instructions on how to add a role to a host, see the Cloudera Manager documentation: Adding a Role Instance Workaround 2 - Disable Spark Lineage Collection To disable the feature, log in to Cloudera Manager and go to the Spark 2 service. Click Configuration. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection. Click Save Changes.

AutoIN · ‎05-02-2018

@jirapong this is a known issue which we've recently seen in CDS 2.3 On Spark 2.3 the nativeLoader (SnappyNativeLoader’s) parentClassLoader is now an ExecutorClassLoader , whereas the parentClassLoader was a Launcher$ExtClassLoader prior to Spark 2.3. This created incompatibility with the snappy version (snappy-java.1.0.4.1) packaged with CDH. We are currently working on a solution in a future release, but there are two workarounds: 1) Use a later version of the Snappy library, which works with the above-mentioned class loader change, for example, snappy-java-1.1.4. Place the new snappy-java library on a local file system (for example /var/snappy). Then run your spark application with the user classpath options as shown below: spark2-shell --jars /var/snappy/snappy-java-1.1.4.jar --conf spark.userClassspathFirst=true --conf spark.executor.extraClassPath="./snappy-java-1.1.4.jar" 2) Instead of using Snappy, you can set the compression by changing the codec to LZ4 or UNCOMPRESSED (which you've already tested).

AutoIN · ‎05-01-2018

@Swasg by any chance are you using the package name in the spark-shell? Something like spark-shell --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11-2.3.0 The error suggests that the format should be in the form of 'groupId:artifactId:version' but in your case it's 'groupId:artifactId-version'. If you are using the package in the command line or somewhere in your configuration, please modify it to: org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.0

AutoIN · ‎05-01-2018

@rams the error is correct as the syntax in pyspark varies from that of scala. For reference here are the steps that you'd need to query a kudu table in pyspark2 Create a kudu table using impala-shell # impala-shell CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; insert into test_kudu values (100, 'abc'); insert into test_kudu values (101, 'def'); insert into test_kudu values (102, 'ghi'); Launch pyspark2 with the artifacts and query the kudu table # pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'. >>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load() >>> kuduDF.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+ For records, the same thing can be achieved using the following commands in spark2-shell # spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT scala> import org.apache.kudu.spark.kudu._ import org.apache.kudu.spark.kudu._ scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu scala> df.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+

Sri_Kumaran · ‎04-19-2018

Thanks a lot! Finally, Sqoop.. 🙂

Online	Offline
Last Visited	‎06-16-2022 06:25 AM

Member Since	‎11-16-2015 10:11 PM
Last Visited	‎06-16-2022 06:25 AM
Posts	195
Kudos received	36

Cloudera Community

Re: Problem starting CDSW sessions after deleting ...

Re: cdsw containers crashing

Re: cdsw -tcp-ingress controller failing..grpc_sta...

Re: CDSW 1.6 Release Date and OS Requirements

Re: Allocating more than 50% of memory in cdsw

Re: Spark - Cannot mkdir file

Re: How to avoid time out waiting for client conne...

Re: Allocating more than 50% of memory in cdsw

Re: Unable to run multiple pyspark sessions

Re: Will Spark SQL be officially supported in CDH6...

Re: CDS 2.3 release 2 Lineage File Missing Error

Re: Spark2 save insert data to Hive with snappy co...

Re: Error when executing the spark-shell

Re: How do you connect to Kudu via PySpark

Re: Can Spark SQL replaces Sqoop for Data Ingestio...