Created on 08-01-2019 01:33 PM - last edited on 08-02-2019 09:43 AM by VidyaSargur
Hello,
I am trying to load tables from Kudu to HDFS using spark2 and i have noticed that timestamp is off by 8 hours between Kudu and HDFS.
df=spark_session.read.format('org.apache.kudu.spark.kudu')
.option('kudu.master','dcaldd163:7051,dcaldd162:7051,dcaldd161:7051')
.option('kudu.table',"impala::DB.kudu_table_name").load()
df.write.format("parquet").mode('overwrite').saveAsTable("db_name.kudu_table_name")
I have tried to set the timezone locally for the session in Spark2 and still it does not solve the issue.
Can someone give a hint on how to solve this issue?
Created 08-02-2019 02:35 AM
hello @GopiG ,
have you tried setting the executor's and the driver's params in spark-defaults.conf ?
spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC
you can set the default time zone UTC or any example you want like GMT+8 etc...
cheers.
Created 08-02-2019 02:35 AM
hello @GopiG ,
have you tried setting the executor's and the driver's params in spark-defaults.conf ?
spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC
you can set the default time zone UTC or any example you want like GMT+8 etc...
cheers.
Created 08-06-2019 01:11 PM
Thank you so much for your response.
Unfortunately the solution did not work for me.
Cloudera Version -> CDH-5.16.1-1.cdh5.16.1.p0.3
spark version -> 2.3.0
Instead of making changes in spark-defaults.conf file, i have passed the executor's and the driver's params along spark2 submit command.
I have tried it with UTC, UTC+8, GMT+8 and America/Los_Angeles timezone , but none of them changed the time in date portion.
I have copied the entire spark2 submit command for your reference.
===========================================================================
command = "spark2-submit --deploy-mode cluster --master yarn --executor-memory " + executor_memory + \
" --name " + job_name + " --executor-cores " + executor_cores + " --driver-memory " + driver_memory \
+ " --conf spark.dynamicAllocation.initialExecutors=" + num_executors \
+ " --conf spark.dynamicAllocation.minExecutors=2" \
+ " --conf spark.dynamicAllocation.maxExecutors=" + str(max_executor) \
+ " --py-files " + utils_file + "," + module_name \
+ " --conf spark.dynamicAllocation.executorIdleTimeout=10" \
+ " --conf spark.serializer=org.apache.spark.serializer.KryoSerializer" \
+ " --conf spark.task.maxFailures=14" \
+ " --conf spark.port.maxRetries=50" \
+ " --conf spark.yarn.max.executor.failures=14" \
+ " --conf spark.executor.memoryOverhead=2000" \
+ " --conf spark.yarn.maxAppAttempts=1" \
+ " --packages org.apache.kudu:kudu-spark2_2.11:1.6.0 "
command += " --files {4},{1},{5},{7} --conf spark.executor.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' --conf spark.driver.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' {0} {3} {2}".format(PROCESS_HANDLER_FILE_PATH, CONFIG_FILE_PATH, job_name, os.path.basename(CONFIG_FILE_PATH), process_csv, log4j_file, os.path.basename(log4j_file), module_base_table_path)
===========================================================================
After submitting the above command, i could see it setting params properly from SPARK properties of YARN. Below lines are copied from Spark Properties while the job is running.
spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8
spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8
Appreciate your response.
Created 01-05-2021 02:42 AM
Hi @GopiG,
There are several issues, you have to consider.
1. How has the data been written to the Kudu table.
- via Impala: timestamp remains local
- via Spark: timestamp will be converted to UTC in Kudu (however you can change this behavior in spark.conf)
2. Reading the Kudu table in Spark
Timestamp will be converted from UTC to local, so you have local times in your data frame.
3. Writing the DataFrame to Hive parquet
Local timestamp is converted to UTC.
You have to check the following configuration options:
- spark.sql.parquet.int96TimestampConversion
- use_local_tz_for_unix_timestamp_conversions
- convert_legacy_hive_parquet_utc_timestamps
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_timestamp.html
Created 07-13-2022 12:10 AM
hi @pszabados
- via Spark: timestamp will be converted to UTC in Kudu (however you can change this behavior in spark.conf)
please, can you share the option to set?