About vanhalen

ArchenROOT · ‎05-31-2019

Its a problem with permissions, you need to let spark let know about local dir, following code then works: def xmlConvert(spark): etl_time = time.time() df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load( '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/train/') df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum( "TagValue").na.fill(0) df.repartition(1).write.csv( path="/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/result/", mode="overwrite", header=True, sep=",") print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time)) if __name__ == '__main__': spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('job.local.dir', '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \ .config('spark.driver.memory','64g') \ .config('spark.debug.maxToStringFields','200') \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() print('Session created') try: xmlConvert(spark) finally: spark.stop()

mattschutz · ‎05-31-2019

Not helpful yet, but promising... PIVOT keyword is reserved for future use! https://www.cloudera.com/documentation/enterprise/6/6.2/topics/impala_reserved_words.html

Tim Armstrong · ‎03-28-2018

@alpertankut current link is https://www.cloudera.com/documentation/enterprise/latest/topics/impala_analytic_functions.html#row_number

Preethi · ‎02-13-2018

How to integrate impala and spark using scala?

josholsan · ‎09-01-2017

Finally I tested your solution and it worked for me! I'm going to mark your answer as solution. Thanks you so much 😄 Jose.

vanhalen · ‎08-30-2017

there is a uuid function in impala that you can use to generate surrogate keys for kudu. or you can write an impala udf to generate unique bigints.

Todd Lipcon · ‎07-26-2017

You could use 'tinker step 500' and have the effect that stepping would only be enabled for time differences more than 500ms. I wouldn't consider this breaking your production environment, but I guess you may have some reason that '-x' is important to you. We'll work on addressing this in a future release so that no system-wide changes are necessary. -Todd

vanhalen · ‎06-19-2017

The best way to deal with small files is to not have to deal with them at all. You might want to explore using Kudu or HBase as your storage engine instead of HDFS (Parquet).

Online	Offline
Last Visited	‎12-10-2017 05:50 PM

Member Since	‎06-13-2017 11:44 PM
Last Visited	‎12-10-2017 05:50 PM
Posts	25
Kudos received	3

Cloudera Community

Re: Writing from Spark to a shared file system

Re: Transpose columns to rows

Re: Sequence number generation in impala

Re: SPARK Dataframe and IMPALA CREATE TABLE issue

Re: [Impala] - GC overhead limit exceeded error in...

Re: IMPALA: Adding PRIMARY KEY while doing CREATE ...

Re: kudu service are getting down frequently

Re: Any good methods for compacting small files in...