Member since
06-13-2017
25
Posts
3
Kudos Received
0
Solutions
05-31-2019
10:32 AM
Its a problem with permissions, you need to let spark let know about local dir, following code then works: def xmlConvert(spark): etl_time = time.time() df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load(
'/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/train/') df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum(
"TagValue").na.fill(0) df.repartition(1).write.csv( path="/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/result/", mode="overwrite", header=True, sep=",") print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time))
if __name__ == '__main__': spark = SparkSession \
.builder \
.appName('XML ETL') \
.master("local[*]") \
.config('job.local.dir', '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \
.config('spark.driver.memory','64g') \
.config('spark.debug.maxToStringFields','200') \
.config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
.getOrCreate() print('Session created')
try: xmlConvert(spark)
finally: spark.stop()
... View more
05-31-2019
10:29 AM
Not helpful yet, but promising... PIVOT keyword is reserved for future use! https://www.cloudera.com/documentation/enterprise/6/6.2/topics/impala_reserved_words.html
... View more
03-28-2018
09:43 AM
@alpertankut current link is https://www.cloudera.com/documentation/enterprise/latest/topics/impala_analytic_functions.html#row_number
... View more
02-13-2018
11:13 PM
How to integrate impala and spark using scala?
... View more
09-01-2017
02:26 AM
Finally I tested your solution and it worked for me! I'm going to mark your answer as solution. Thanks you so much 😄 Jose.
... View more
08-30-2017
12:22 AM
there is a uuid function in impala that you can use to generate surrogate keys for kudu. or you can write an impala udf to generate unique bigints.
... View more
07-26-2017
11:26 PM
You could use 'tinker step 500' and have the effect that stepping would only be enabled for time differences more than 500ms. I wouldn't consider this breaking your production environment, but I guess you may have some reason that '-x' is important to you. We'll work on addressing this in a future release so that no system-wide changes are necessary. -Todd
... View more
06-19-2017
05:24 PM
The best way to deal with small files is to not have to deal with them at all. You might want to explore using Kudu or HBase as your storage engine instead of HDFS (Parquet).
... View more