Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

"Error parsing row: file" Table consists of multiple csv files

avatar
New Contributor

Dear community,

 

I have created a new datatable by uploading a csv file (incl. header / the csv file contains data about a specific months) to HDFS (via Hue). Afterwards I have cleared the cache and uploaded the other csv files (all following csv files have the same column order BUT NO HEADER; average size of every monthly csv file: ~2-4 GB; number of columns: 54).

Typical procedure after uploading a new csv file to the database: INVALIDATE METADATA database_xy

 

When I send a query where every column shall be displayed I get the following Error Messages in Impala:

Error converting column: 6 to TIMESTAMP
Error converting column: 8 to TIMESTAMP
Error converting column: 23 to TIMESTAMP
Error converting column: 50 to TIMESTAMP
Error converting column: 35 to TIMESTAMP
Error converting column: 43 to TIMESTAMP

 

Information for this columns are available after 4 months. Till then there are only NULL values.

 

Query to reproduce these error messages: 

 

SELECT *
FROM database_xy

LIMIT 100

 

For a specific TIMESTAMP column:

SELECT min(exp_date)
FROM database_xy

 

Error (Just a sample of the log box in Hue): 

Error parsing row: file: hdfs://blabla/foo_042019.csv,
before offset: 2432696320
Error converting column: 21 to TIMESTAMP
Error parsing row: file: hdfs://blabla/foo_032019.csv,
before offset: 1895825408
Error converting column: 21 to TIMESTAMP
Error converting column: 21 to TIMESTAMP
Error parsing row: file: hdfs://blabla/foo_022019.csv,
before offset: 2969567232
Error converting column: 21 to TIMESTAMP
Error converting column: 21 to TIMESTAMP

 

 

When I run the queries in Hive I get no error messages at all. How come? And how do i get rid of those error messages in Impala?

 

 

Information about how I created the csv-files locally:

First CSV:

Python (Pandas): Set Options: Separator: Pipe, (only for first csv:) header=True, index=False (so there is no additional useless index column)

 

Subsequent CSVs:

Python (Pandas): Set Options: Separator: Pipe, header=False, index=False (so there is no additional useless index column)

 

When I created the table with the first CSV in Hue I selected the following options:

Field Separator: Pipe

Record Separator: New line

Quote Character: Double Quote

 

Afterwards I have uploaded all the other CSVs in the database's folder to add the new months and invalidated the metadata.

 

Thank you for your help in advance! I hope you enjoyed the Christmas holidays and I wish you a happy New Year's Eve!

 

Best

somedatadude

3 REPLIES 3

avatar
Contributor

The error "Error converting column: 35 to TIMESTAMP" means there was an error when converting column 35 to the TIMESTAMP type.

 

The error "Error parsing row: file: hdfs://blabla/foo_042019.csv, before offset: 2432696320" means there was an error while parsing the row at file offset 2432696320, in the file foo_042019.csv.

 

So it looks like there are several rows in your dataset where certain fields cannot be converted to TIMESTAMPs. You should be able to open up the file, and seek to the specified offset to find the rows that are corrupted.

 

I believe, Hive does not throw an exception when given the same dataset, instead it converts the corrupted rows to NULL. The same behavior can be emulated in Impala by setting 'abort_on_error=false'. However, be warned that setting this option can mask data corruption issues. See https://impala.apache.org/docs/build/html/topics/impala_abort_on_error.html for details.

avatar
New Contributor

Thank you very much for your answer SahilTakiar. Could you tell me what offset means and how i can make Impala showing me the specific row(s) causing the errors?

 

Your answer is very much appreciated! 🙂 Sorry if the question is simple. I am just new to HDFS.

 

Best

avatar
Contributor

Offset means the offset into the actual csv file. So in this case, that means the 2432696320th byte of the file foo_042019.csv. There are multiple tools that should allow you to open the file and seek to the desired offset. For example, you could open the file in vim and run

:goto 2432696320

 which should seek the cursor to the 2432696320th byte of the file, and thus the offending row.