About ArchenROOT

ArchenROOT · ‎05-31-2019

Its a problem with permissions, you need to let spark let know about local dir, following code then works: def xmlConvert(spark): etl_time = time.time() df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load( '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/train/') df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum( "TagValue").na.fill(0) df.repartition(1).write.csv( path="/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/result/", mode="overwrite", header=True, sep=",") print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time)) if __name__ == '__main__': spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('job.local.dir', '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \ .config('spark.driver.memory','64g') \ .config('spark.debug.maxToStringFields','200') \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() print('Session created') try: xmlConvert(spark) finally: spark.stop()

ArchenROOT · ‎05-31-2019

And I found a solution by pointint job.local.dir to directory with the code: spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('job.local.dir', 'file:/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() Now all works

ArchenROOT · ‎05-31-2019

My simple ETL code: def xmlConvert(spark): etl_time = time.time() df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load( 'file:///home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/data_train') df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum( "TagValue").na.fill(0) df.repartition(1).write.csv( path="file:///proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/", mode="overwrite", header=True, sep=",") print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time)) if __name__ == '__main__': spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() print('Session created') try: xmlConvert(spark) finally: spark.stop() Still throwing the issue reported.

ArchenROOT · ‎02-03-2019

Hi, I am not expert on administrating Cloudera, but from existing Express docker image found i upgraded to latest 5.*. I created this docker image as kind of base for new project development. https://cloud.docker.com/u/archenroot/repository/docker/archenroot/cloudera-cdap-jdk8 What could be the right steps to upgrade this image to 6.*? @maziyar - if you are willing to help, I can add you as collaborator in dockerhub...

ArchenROOT · ‎11-14-2018

I am happy you fixed the issue, but next time you might consider writing some details about how you get out of that trouble situation as others might be in same situation as well 🙂

Online	Offline
Last Visited	‎05-31-2019 01:55 PM

Member Since	‎11-14-2018 09:02 PM
Last Visited	‎05-31-2019 01:55 PM
Posts	9

Cloudera Community

Re: Writing from Spark to a shared file system

Re: Spark - Cannot mkdir file

Re: Spark - Cannot mkdir file

Re: Cloudera Express 6.x release

Re: Error in kafka consumer