About GopiG

GopiG · ‎07-29-2020

I am trying to create a range partition on timestamp column by year on a KUDU table. I could find a solution yet. The table have hash partition on primary keys and have 131 M records. I have a requirement to extract the records created/updated in last 6 months. I assume range partition on the lastupdateddate column may help to fetch the data faster as it would avoid full table scan. Appreciate your thoughts.

GopiG · ‎08-06-2019

Thank you so much for your response. Unfortunately the solution did not work for me. Cloudera Version -> CDH-5.16.1-1.cdh5.16.1.p0.3 spark version -> 2.3.0 Instead of making changes in spark-defaults.conf file, i have passed the executor's and the driver's params along spark2 submit command. I have tried it with UTC, UTC+8, GMT+8 and America/Los_Angeles timezone , but none of them changed the time in date portion. I have copied the entire spark2 submit command for your reference. =========================================================================== command = "spark2-submit --deploy-mode cluster --master yarn --executor-memory " + executor_memory + \ " --name " + job_name + " --executor-cores " + executor_cores + " --driver-memory " + driver_memory \ + " --conf spark.dynamicAllocation.initialExecutors=" + num_executors \ + " --conf spark.dynamicAllocation.minExecutors=2" \ + " --conf spark.dynamicAllocation.maxExecutors=" + str(max_executor) \ + " --py-files " + utils_file + "," + module_name \ + " --conf spark.dynamicAllocation.executorIdleTimeout=10" \ + " --conf spark.serializer=org.apache.spark.serializer.KryoSerializer" \ + " --conf spark.task.maxFailures=14" \ + " --conf spark.port.maxRetries=50" \ + " --conf spark.yarn.max.executor.failures=14" \ + " --conf spark.executor.memoryOverhead=2000" \ + " --conf spark.yarn.maxAppAttempts=1" \ + " --packages org.apache.kudu:kudu-spark2_2.11:1.6.0 " command += " --files {4},{1},{5},{7} --conf spark.executor.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' --conf spark.driver.extraJavaOptions=\'-Dlog4j.configuration={6} -Duser.timezone=UTC+8\' {0} {3} {2}".format(PROCESS_HANDLER_FILE_PATH, CONFIG_FILE_PATH, job_name, os.path.basename(CONFIG_FILE_PATH), process_csv, log4j_file, os.path.basename(log4j_file), module_base_table_path) =========================================================================== After submitting the above command, i could see it setting params properly from SPARK properties of YARN. Below lines are copied from Spark Properties while the job is running. spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8 spark.executor.extraJavaOptions -Dlog4j.configuration=spark2_log4j.properties -Duser.timezone=UTC+8 Appreciate your response.

GopiG · ‎08-01-2019

Hello, I am trying to load tables from Kudu to HDFS using spark2 and i have noticed that timestamp is off by 8 hours between Kudu and HDFS. df=spark_session.read.format('org.apache.kudu.spark.kudu') .option('kudu.master','dcaldd163:7051,dcaldd162:7051,dcaldd161:7051') .option('kudu.table',"impala::DB.kudu_table_name").load() df.write.format("parquet").mode('overwrite').saveAsTable("db_name.kudu_table_name") I have tried to set the timezone locally for the session in Spark2 and still it does not solve the issue. Can someone give a hint on how to solve this issue?

Online	Offline
Last Visited	‎08-03-2020 12:52 PM

Member Since	‎08-01-2019 01:34 PM
Last Visited	‎08-03-2020 12:52 PM
Posts	3

Cloudera Community

Kudu Partition on Timestamp column

Re: Kudu to HDFS data load timestamp issue.

Kudu to HDFS data load timestamp issue.