Created on 07-01-2019 08:24 AM - edited 09-16-2022 07:29 AM
another case to think about!
I created a table in HIVE as a textfile. When I query it, it looks fine for all records.
Next, in Impala, I use the INVALIDATE METADATA statement and afterwards , query the table. Now Impala shows me for some records a question mark as if there are special characters for a couple of records (�). I notice the data in these fields can not be used anywhere else in following steps (instead of reading values, he will complain the values are not valid).
When I examine the textfile on hdfs (through some text editor like sublime with UTF-8 encoding ), I see no special characters and all characters encountered, look as expected.
As said, invalidate metadata nor refresh fixes the issue but after a restart of the impala services, the data is available as expected in impala.
Currently we create the table as text file and get the behavior described above. Before we created the table as a parquet file, but then got the error :
File 'hdfs://ourcluster/user/hive/warehouse/tmp.db/thetable/000000_0' has an invalid version number: <some value> This could be due to stale metadata. Try running "refresh tmp.thetable".
Note that this <some value> would always be something that comes from the data (a part of a string that is in the data). The refresh would not fix it, (and as said we already do an invalidate metadata). Note that when we restart the impala service, this error goes away and the data can be queried. The files then seem to be "uncorrupted". I have read a similar post elsewhere that suggests the data would be corrupted when one encounters this error.
Note: we use a collect_set function to create the field that gives the problem during the creation of the table in HIVE.Our current trail of thought is that in some cases (15 out of several million) this gives problematic results but what happens exactly is not understood.
Thanks for any input!
I noticed my title is wrong (did not find the edit button )- it should be :
Thank you for your interest!
We are using cdh 6.2 , impala 3.2.0-cdh6.2.0, HIVE Hive 2.1.1-cdh6.2.0.
Some updated info on the case above: I notice the files are not corrupted.
So the files are created in some table in HIVE, but when queried he shows sometimes special characters.
I have also encountered that he shows a concatenation of 2 lines that are not related to each other, on one line. I could even trace that he was getting one line from 1 of the files that make up the table, and incorrectly combined it with the end of another line that was even in a different file!
Situation like this:
file 1 contains id: some_id_1, data1, data2, data3
file 10 contains: some_id_2, otherdata1,otherdata2,otherdata3
SELECT * FROM <problematictable> WHERE id='some_id_1'
should return --> some_id_1, data1, data2, data3
some_id_1, data1, data2herdata2,otherdata3
When I restart impala services, and the table is queried, it shows the results as expected.
When you create a new external table with a different location, and cp the files to that location, and query this new table , the results are as expected.
It might have to do with the metadata store? Maybe he has problems to know where he needs to retrieve the data? And after a restart of the services, everything is flushed and he does this correctly