Created 08-02-2024 07:49 AM
We are experiencing an issue with converting data from string to timestamp in Hive, where the timestamps undergo unexpected hour changes during conversion and processing.
We would like to verify if this could be a bug in the tool or if there are additional configurations required.
Process and Steps to Reproduce the Issue: Loading Data into HDFS:
1. We use the hdfs put command to load a positional file into HDFS and store it in a directory accessible via an external table ex_sa_beneficios_mantidos where the data is in string format.
2. We create an external table in Hive to access the data as strings.
3. Converting String to Timestamp and Writing to ORC Table: When converting the string to timestamp, we notice an alteration in the hour of the dt_afast_trabalho field.
4. Verifying the Conversion and Data Reading: When reading the data from the ORC table, the hours in the dt_afast_trabalho field were different from expected.
5. Counting Timestamps by Hour: We performed a count by hour of the timestamps to verify the changes. Spark Configuration in Zeppelin: When initializing Hive in Zeppelin, we used the following parameters:
Expected Behavior: We expect the hour values of the dt_afast_trabalho field to remain consistent throughout the conversion and processing stages, respecting the America/Sao_Paulo timezone.
Current Behavior: We observe that the hour values of the dt_afast_trabalho field are being altered during the process of converting from string to timestamp and subsequent processing in Hive and Spark.
Created 08-05-2024 02:59 AM
@evanle96, Welcome to our community! To help you get the best possible answer, I have tagged our Hive/Spark experts @RangaReddy @asish @smruti who may assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 08-05-2024 03:31 AM
@evanle96 Hive would treat the timestamp data to be in UTC format, and when we read it, it converts it from UTC to local time. But I understand that the actual data is already in America/Sao_Paulo tz. It'd be interesting to know how do we transform the epoch data in string format to timestamp in ORC?
from_utc_timestamp(to_utc_timestamp(dt_afast_trabalho, 'America/Sao_Paulo'), 'America/Sao_Paulo') AS sao_paulo_time_corrected
Can we try reading with the above transformation and see if the house field shows up as expected?
Created 08-08-2024 08:38 AM
Hello @smruti
Here's a printout of a query in the data source table in string format, I'm going to provide a query in ORC format
Created 08-09-2024 11:03 AM
I've validated it with the user, and here's what happens.
The date field comes in string format and it only comes with 8 fields "Yyyy-mm-dd" and it was converting to time stamp which in this case aggregates the time fields, and that's where the divergence was.
Is there anything in the tool that can be done to standardize this or do I need to change the data engineering?