I am trying to create a table in impala (cdh 5.5) but it fails with the below error message.The parquet file has many struct data types which is more than 4000 characters long.I did not face this issue when using hive but suprisingly came up for Impala.I would appreciate any solution.
[quickstart.cloudera:21000] > CREATE TABLE columns_from LIKE PARQUET '/projects/dps/test/parquet/test.parquet' STORED AS PARQUET;
Query: create TABLE columns_from LIKE PARQUET '/projects/dps/test/parquet/test.parquet' STORED AS PARQUET
ERROR: AnalysisException: Type of column 'segment1' exceeds maximum type length of 4000 characters:
I'm surprised you made it work with Hive because the Hive Metastore backend DB has a limit of 4000 characters on the column containing the type. Specifically, the TYPE_NAME column in the COLUMNS_V2 table has this limit.
Do you mind posting the steps to make it work with Hive?
.I created a hive table pointing to an Avro schema and that Avro schema has this particular nesting which has columns more than 4000 character.It did not create any issues for me to run queries against that hive table.Since impala supports complex data types only with parquet I went ahead and created a parquet file with data in it and tried creating the table in impala and encountered this issue.Let me know if you have more questions.
thanks for the explanation. I'm guessing that since you are creating a table from an Avro schema that some of the restrictions may not apply because the type name is not stored in the Metastore backend (this is just a guess, I do not know for sure). When you DESCRIBE the original Avro table, does the column in question really have a type with > 4000 characters? If you want to confirm/deny my suspicion, you could try creating an Avro table with CREATE TABLE (c1 type, c2, type...) STORED AS AVRO (without providing an Avro schema), and see if Hive accepts that as well. Using SHOW CREATE TABLE in Impala or Hive might help.
I'm afraid that there currently is no way to increase the 4000 character limit. Seems like splitting up the big type or shrinking names is the only recourse for now.
If I try to create table using a DDL with explicit column names in hive it will fail,but if I use the avro schema to define the table it would not.I guess the only way for me now is to flatten out the schema.I was hopeful that impala would be able to handle structs if the nesting is less than 100 columns.Looks like there is a limit on the length of the column names in the struct as well.Thanks for your help!
unfortunately, that is correct. There's a character limit on the type. It's good to hear your feedback on this issue, because then we work on increasing the default limit, or on finding a practical solution to changing it in next releases.
So I am encountering this even on HIve(cdh 5.5) when running a DDL that used to work with cdh 5.3.2(since it was failing silently on the metastore).I am guessing I would have to restructure my schema to avoid this(have less than 4000 bytes added up in the when aggregating the column names in struct).The problem here is that much of my java code which operates on this schema has already been done.What would be best way to re organize the schema.Would really appreciate if you could give me some pointers.
we got the same problem. We have a nested schema which type length exceeds 4000 characters. Has there been any progress so far?
@Nishan how are you dealing with it now?
There is a ticket for HIVE here. As a workaround someone suggested "[to] manually alter the metastore database schema, increasing the column width or changing the column type". Have you tried that?
We also suspect, that it might be implemented in the impalaclient itself as some kind of validation independent from hive.
It's correct that Impala itself does the validation, so only altering the metastore backend DB will not help for Impala. If you want this issue prioritized please leave a short note on the Impala JIRA. I agree it's an annoying limitation.