Support Questions

Find answers, ask questions, and share your expertise

Kudu to HDFS data load timestamp issue.

avatar
New Contributor

Hello,

I am trying to load tables from Kudu to HDFS using spark2 and i have noticed that timestamp is off by 8 hours between Kudu and HDFS. 

 

df=spark_session.read.format('org.apache.kudu.spark.kudu')
.option('kudu.master','dcaldd163:7051,dcaldd162:7051,dcaldd161:7051')
.option('kudu.table',"impala::DB.kudu_table_name").load()

 

df.write.format("parquet").mode('overwrite').saveAsTable("db_name.kudu_table_name")

 

I have tried to set the timezone locally for the session in Spark2 and still it does not solve the issue. 

 

Can someone give a hint on how to solve this issue? 

1 ACCEPTED SOLUTION

avatar
Rising Star

hello @GopiG ,
have you tried setting the executor's and the driver's params in spark-defaults.conf ?

spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC


you can set the default time zone UTC or any example you want like GMT+8 etc...

 cheers.

View solution in original post

4 REPLIES 4

avatar
Rising Star

hello @GopiG ,
have you tried setting the executor's and the driver's params in spark-defaults.conf ?

spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC


you can set the default time zone UTC or any example you want like GMT+8 etc...

 cheers.

avatar
New Contributor

Thank you so much for your response. 

 

Unfortunately the solution did not work for me. 

Cloudera Version -> CDH-5.16.1-1.cdh5.16.1.p0.3

spark version  ->  2.3.0


Instead of making changes in spark-defaults.conf file, i have passed the executor's and the driver's params along spark2 submit command. 

I have tried it with UTC, UTC+8, GMT+8 and America/Los_Angeles timezone , but none of them changed the time in date portion.

I have copied the entire spark2 submit command for your reference.

 

===========================================================================

command = "spark2-submit --deploy-mode cluster --master yarn --executor-memory " + executor_memory + \
" --name " + job_name + " --executor-cores " + executor_cores + " --driver-memory " + driver_memory \
+ " --conf spark.dynamicAllocation.initialExecutors=" + num_executors \
+ " --conf spark.dynamicAllocation.minExecutors=2" \
+ " --conf spark.dynamicAllocation.maxExecutors=" + str(max_executor) \
+ " --py-files " + utils_file + "," + module_name \
+ " --conf spark.dynamicAllocation.executorIdleTimeout=10" \
+ " --conf spark.serializer=org.apache.spark.serializer.KryoSerializer" \
+ " --conf spark.task.maxFailures=14" \
+ " --conf spark.port.maxRetries=50" \
+ " --conf spark.yarn.max.executor.failures=14" \
+ " --conf spark.executor.memoryOverhead=2000" \
+ " --conf spark.yarn.maxAppAttempts=1" \
+ " --packages org.apache.kudu:kudu-spark2_2.11:1.6.0 "

command += " --files {4},{1},{5},{7} --conf spark.executor.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' --conf spark.driver.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' {0} {3} {2}".format(PROCESS_HANDLER_FILE_PATH, CONFIG_FILE_PATH, job_name, os.path.basename(CONFIG_FILE_PATH), process_csv, log4j_file, os.path.basename(log4j_file), module_base_table_path)

===========================================================================

 

After submitting the above command, i could see it setting params properly from SPARK properties of YARN. Below lines are copied from Spark Properties while the job is running.

 

spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8
spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8

 

Appreciate your response. 

avatar
New Contributor

Hi @GopiG,

 

There are several issues, you have to consider.

 

1. How has the data been written to the Kudu table.

- via Impala: timestamp remains local

- via Spark: timestamp will be converted to UTC in Kudu (however you can change this behavior in spark.conf)

 

2. Reading the Kudu table in Spark

Timestamp will be converted from UTC to local, so you have local times in your data frame.

 

3. Writing the DataFrame to Hive parquet

Local timestamp is converted to UTC.

 

You have to check the following configuration options:

- spark.sql.parquet.int96TimestampConversion

- use_local_tz_for_unix_timestamp_conversions

- convert_legacy_hive_parquet_utc_timestamps

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_timestamp.html

 

avatar
New Contributor

hi @pszabados


- via Spark: timestamp will be converted to UTC in Kudu (however you can change this behavior in spark.conf)

please, can you share the option to set?