Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Issue with String to Timestamp Conversion in Hive and Spark Resulting in Unexpected Hour Changes

avatar
Explorer

We are experiencing an issue with converting data from string to timestamp in Hive, where the timestamps undergo unexpected hour changes during conversion and processing.

We would like to verify if this could be a bug in the tool or if there are additional configurations required.

Process and Steps to Reproduce the Issue: Loading Data into HDFS:

1. We use the hdfs put command to load a positional file into HDFS and store it in a directory accessible via an external table ex_sa_beneficios_mantidos where the data is in string format.

2. We create an external table in Hive to access the data as strings.

3. Converting String to Timestamp and Writing to ORC Table: When converting the string to timestamp, we notice an alteration in the hour of the dt_afast_trabalho field.

4. Verifying the Conversion and Data Reading: When reading the data from the ORC table, the hours in the dt_afast_trabalho field were different from expected.

5. Counting Timestamps by Hour: We performed a count by hour of the timestamps to verify the changes. Spark Configuration in Zeppelin: When initializing Hive in Zeppelin, we used the following parameters:

Spoiler
%livy.spark from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build() spark.conf.set("spark.datasource.hive.warehouse.read.mode", value="SECURE_ACCESS") spark.conf.set("spark.datasource.hive.warehouse.load.staging.dir", value="hdfs://pcdpclusterdatalakennha:8020/dados/datalake/zona_temporaria/dadosprovisorios/hwc") spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNoemptyLocation", value="true") spark.conf.set("spark.sql.session.timeZone", "America/Sao_Paulo")

Expected Behavior: We expect the hour values of the dt_afast_trabalho field to remain consistent throughout the conversion and processing stages, respecting the America/Sao_Paulo timezone.

Current Behavior: We observe that the hour values of the dt_afast_trabalho field are being altered during the process of converting from string to timestamp and subsequent processing in Hive and Spark.thumbnail_image-20240801-204117.pngthumbnail_image-20240801-204102.pngthumbnail_image-20240801-204045.pngthumbnail_image-20240801-204011.png

4 REPLIES 4

avatar
Community Manager

@evanle96, Welcome to our community! To help you get the best possible answer, I have tagged our Hive/Spark experts @RangaReddy @asish @smruti  who may assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Collaborator

@evanle96 Hive would treat the timestamp data to be in UTC format, and when we read it, it converts it from UTC to local time. But I understand that the actual data is already in America/Sao_Paulo tz. It'd be interesting to know how do we transform the epoch data in string format to timestamp in ORC?

from_utc_timestamp(to_utc_timestamp(dt_afast_trabalho, 'America/Sao_Paulo'), 'America/Sao_Paulo') AS sao_paulo_time_corrected

Can we try reading with the above transformation and see if the house field shows up as expected?

avatar
Explorer

Hello @smruti 

Here's a printout of a query in the data source table in string format, I'm going to provide a query in ORC format

08ddc420-429f-48e7-a272-041cfc7bf91a.png

 

avatar
Explorer

@smruti 

I've validated it with the user, and here's what happens.

The date field comes in string format and it only comes with 8 fields "Yyyy-mm-dd" and it was converting to time stamp which in this case aggregates the time fields, and that's where the divergence was.

Is there anything in the tool that can be done to standardize this or do I need to change the data engineering?