Member since
11-10-2016
6
Posts
1
Kudos Received
0
Solutions
12-21-2016
07:51 AM
Thanks! Only a question is coming my mind with your response; you store the dates in string type with a new column added (time zone) and then you perform the transformations when you does the query or you store dates & timestamps in parquet timestamp type -------- It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone. I personally prefer not modifying data automatically and control any time zone corrections within my ETL process. I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later.
... View more
12-13-2016
11:16 AM
Thanks I have executed some tests and I have seen how Lineage Graph works definitely
... View more
12-12-2016
10:19 AM
Thanks Mark for your reply, I was really asking about how is the method that Cloudera Navigator uses to generate the linage among different actions, I understand that Cloudera Navigators link each action (hive, spark) examining logs and what the action writes and what the action reads in HDFS. But I am not sure. Regards,
... View more
12-12-2016
10:11 AM
Hi clouder people, I am really surprised reading this documentation around parquet files & timestamp type & hive VS impala: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html. Above all for this two things: When the table uses Parquet format, Impala expects any time zone adjustment to be applied prior to writing, while TIMESTAMP values written by Hive are adjusted to be in the UTC time zone. When Hive queries Parquet data files that it wrote, it adjusts the TIMESTAMP values back to the local time zone, while Impala does no conversion. Hive does no time zone conversion when it queries Impala-written Parquet files. If you have data files written by Hive, those TIMESTAMP values represent the local timezone of the host where the data was written, potentially leading to inconsistent results when processed by Impala. To avoid compatibility problems or having to code workarounds, you can specify one or both of these impaladstartup flags: -use_local_tz_for_unix_timestamp_conversions=true -convert_legacy_hive_parquet_utc_timestamps=true. Although -convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead, Cloudera recommends turning it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid unexpected behavior. I would like to ask your opinion considering the previous link about what is the best way to write a timestamp value (in a parquet file) throught HIVE or / and SPARK to be queried with HIVE and IMPALA and SPARK? Writing the file using HIVE or / and SPARK and suffering the derivated performance problem of setting this two properties -use_local_tz_for_unix_timestamp_conversions=true -convert_legacy_hive_parquet_utc_timestamps=true. Writing the file using IMPALA (preparing the table with HIVE or SPARK previously to avoid complex queries in Impala), due to: Hive does no time zone conversion when it queries Impala-written Parquet files. Another one? Regards,
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
Apache Spark
11-10-2016
08:17 AM
How does the linage diagram of Cloudera Navigator work? I mean, Cloudera Navigator draws the diagram because of Ozzie groups a set of actions (hive, pig, mapreduce, and so on) or CN is able to draw the diagram if you throw indepent and sequencial jobs that you execute, for example, manually
... View more
Labels:
- Labels:
-
Cloudera Navigator