Created on 12-30-2019 04:44 AM - last edited on 12-30-2019 07:08 AM by VidyaSargur
Dear community,
I have created a new datatable by uploading a csv file (incl. header / the csv file contains data about a specific months) to HDFS (via Hue). Afterwards I have cleared the cache and uploaded the other csv files (all following csv files have the same column order BUT NO HEADER; average size of every monthly csv file: ~2-4 GB; number of columns: 54).
Typical procedure after uploading a new csv file to the database: INVALIDATE METADATA database_xy
When I send a query where every column shall be displayed I get the following Error Messages in Impala:
Error converting column: 6 to TIMESTAMP
Error converting column: 8 to TIMESTAMP
Error converting column: 23 to TIMESTAMP
Error converting column: 50 to TIMESTAMP
Error converting column: 35 to TIMESTAMP
Error converting column: 43 to TIMESTAMP
Information for this columns are available after 4 months. Till then there are only NULL values.
Query to reproduce these error messages:
SELECT *
FROM database_xy
LIMIT 100
For a specific TIMESTAMP column:
SELECT min(exp_date)
FROM database_xy
Error (Just a sample of the log box in Hue):
Error parsing row: file: hdfs://blabla/foo_042019.csv,
before offset: 2432696320
Error converting column: 21 to TIMESTAMP
Error parsing row: file: hdfs://blabla/foo_032019.csv,
before offset: 1895825408
Error converting column: 21 to TIMESTAMP
Error converting column: 21 to TIMESTAMP
Error parsing row: file: hdfs://blabla/foo_022019.csv,
before offset: 2969567232
Error converting column: 21 to TIMESTAMP
Error converting column: 21 to TIMESTAMP
When I run the queries in Hive I get no error messages at all. How come? And how do i get rid of those error messages in Impala?
Information about how I created the csv-files locally:
First CSV:
Python (Pandas): Set Options: Separator: Pipe, (only for first csv:) header=True, index=False (so there is no additional useless index column)
Subsequent CSVs:
Python (Pandas): Set Options: Separator: Pipe, header=False, index=False (so there is no additional useless index column)
When I created the table with the first CSV in Hue I selected the following options:
Field Separator: Pipe
Record Separator: New line
Quote Character: Double Quote
Afterwards I have uploaded all the other CSVs in the database's folder to add the new months and invalidated the metadata.
Thank you for your help in advance! I hope you enjoyed the Christmas holidays and I wish you a happy New Year's Eve!
Best
somedatadude
Created 01-17-2020 03:48 PM
The error "Error converting column: 35 to TIMESTAMP" means there was an error when converting column 35 to the TIMESTAMP type.
The error "Error parsing row: file: hdfs://blabla/foo_042019.csv, before offset: 2432696320" means there was an error while parsing the row at file offset 2432696320, in the file foo_042019.csv.
So it looks like there are several rows in your dataset where certain fields cannot be converted to TIMESTAMPs. You should be able to open up the file, and seek to the specified offset to find the rows that are corrupted.
I believe, Hive does not throw an exception when given the same dataset, instead it converts the corrupted rows to NULL. The same behavior can be emulated in Impala by setting 'abort_on_error=false'. However, be warned that setting this option can mask data corruption issues. See https://impala.apache.org/docs/build/html/topics/impala_abort_on_error.html for details.
Created 01-28-2020 04:57 AM
Thank you very much for your answer SahilTakiar. Could you tell me what offset means and how i can make Impala showing me the specific row(s) causing the errors?
Your answer is very much appreciated! 🙂 Sorry if the question is simple. I am just new to HDFS.
Best
Created 01-28-2020 07:59 AM
Offset means the offset into the actual csv file. So in this case, that means the 2432696320th byte of the file foo_042019.csv. There are multiple tools that should allow you to open the file and seek to the desired offset. For example, you could open the file in vim and run
:goto 2432696320
which should seek the cursor to the 2432696320th byte of the file, and thus the offending row.