Member since
12-30-2019
2
Posts
0
Kudos Received
0
Solutions
01-28-2020
04:57 AM
Thank you very much for your answer SahilTakiar. Could you tell me what offset means and how i can make Impala showing me the specific row(s) causing the errors? Your answer is very much appreciated! 🙂 Sorry if the question is simple. I am just new to HDFS. Best
... View more
12-30-2019
04:44 AM
Dear community,
I have created a new datatable by uploading a csv file (incl. header / the csv file contains data about a specific months) to HDFS (via Hue). Afterwards I have cleared the cache and uploaded the other csv files (all following csv files have the same column order BUT NO HEADER; average size of every monthly csv file: ~2-4 GB; number of columns: 54).
Typical procedure after uploading a new csv file to the database: INVALIDATE METADATA database_xy
When I send a query where every column shall be displayed I get the following Error Messages in Impala:
Error converting column: 6 to TIMESTAMP Error converting column: 8 to TIMESTAMP Error converting column: 23 to TIMESTAMP Error converting column: 50 to TIMESTAMP Error converting column: 35 to TIMESTAMP Error converting column: 43 to TIMESTAMP
Information for this columns are available after 4 months. Till then there are only NULL values.
Query to reproduce these error messages:
SELECT * FROM database_xy
LIMIT 100
For a specific TIMESTAMP column:
SELECT min(exp_date) FROM database_xy
Error (Just a sample of the log box in Hue):
Error parsing row: file: hdfs://blabla/foo_042019.csv, before offset: 2432696320 Error converting column: 21 to TIMESTAMP Error parsing row: file: hdfs://blabla/foo_032019.csv, before offset: 1895825408 Error converting column: 21 to TIMESTAMP Error converting column: 21 to TIMESTAMP Error parsing row: file: hdfs://blabla/foo_022019.csv, before offset: 2969567232 Error converting column: 21 to TIMESTAMP Error converting column: 21 to TIMESTAMP
When I run the queries in Hive I get no error messages at all. How come? And how do i get rid of those error messages in Impala?
Information about how I created the csv-files locally:
First CSV:
Python (Pandas): Set Options: Separator: Pipe, (only for first csv:) header=True, index=False (so there is no additional useless index column)
Subsequent CSVs:
Python (Pandas): Set Options: Separator: Pipe, header=False, index=False (so there is no additional useless index column)
When I created the table with the first CSV in Hue I selected the following options:
Field Separator: Pipe
Record Separator: New line
Quote Character: Double Quote
Afterwards I have uploaded all the other CSVs in the database's folder to add the new months and invalidated the metadata.
Thank you for your help in advance! I hope you enjoyed the Christmas holidays and I wish you a happy New Year's Eve!
Best
somedatadude
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
Cloudera Hue