Some context, I was using an old version of CDH 6.3.2, and now I am migrating my Nifi flows to a new CDP CFM data hub.
Generally, my uses cases follow this process:
- Download a file (csv)
- Convert it to Avro (schema inferring, as I do not know the schema)
- Convert it to ORC
- Create a Hive external table in top of the ORC file.
With the CDP cluster I am facing the following issue:
Some data have mixed type (i.e INT and STRING for example); and the following error is raised :
Unknown field type: uniontype<int,string>
I tried to create a custom script to manually modified the schema, but when it is converted to ORC my table schema do not match the underlying data. As the ORC conversion tries to infer the schema again.
The only way I found to put data in Hive through Nifi is to:
- Download a file (csv)
- Convert it to Avro (do not infer and use string)
- Convert it to PARQUET
- Create a Hive external table in top of the PARQUET file.
Thus I am losing a lot of information by converting all my data columns to string.
From the doc you mentioned: "Apache ORC (Optimized Row Columnar) format from Spark application".
So only from Spark, is there anything related to Spark in the question?
Orc is the native format of Hive, for sure it is supported.
Hi @asish ,
I have uploaded the Nifi flow definition, so you can reproduce my issue.(In txt file because I can't upload json extension, just change the extension)
It generates a csv:
Then convert it to Parquet (V1). The issue comes from UpdateHive processor :
Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.ddl.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<int,string>
The issue is not present in Nifi when I convert to another format (Avro for example). But is present when I tried to read back the data from the table (from Hue):
AnalysisException: Could not load table open_data.test from catalog CAUSED BY: TableLoadingException: Could not load table open_data.test from catalog CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Unsupported type 'uniontype' in column 'mixedtype' of table 'test']), lookup_status:OK)
Currently, I am on the last version of Nifi CFM in CDP: 7.2.11 - Flow Management Light Duty with Apache NiFi, Apache NiFi Registry
Thank you for your help,