Created 05-15-2018 02:03 PM
I'm attempting to use Nifi to take data from an Oracle database, and then write it as Parquet files to HDFS, from which I'll then build tables for use in Impala.
I've setup a QueryDatabaseTable processor and linked it up to PutParquet. By default everything works, but all the values are Strings rather than having any dates, ints, doubles etc. After some quick research I realised I need to set the "Use Avro Logical Types" to true in the QueryDatabaseTable, so I did this.
However, when I look at the Avro output from a simple test query, it doesn't look right. Running it through avro-tools tojson I get:
{"TRADEID":{"string":"71416037"},"SPREAD":{"bytes":";"}}
Why would bytes be showing '";"? I'm expecting a value of 58.75 here.
The schema looks ok:
{ "type" : "record", "name" : "CDS_CTC", "namespace" : "VOL_DATA", "fields" : [ { "name" : "TRADEID", "type" : [ "null", "string" ] }, { "name" : "SPREAD", "type" : [ "null", { "type" : "bytes", "logicalType" : "decimal", "precision" : 10, "scale" : 0 } ] } ] }
and without using avro logical types, I get the values I'd expect:
{"TRADEID":{"string":"71416037"},"SPREAD":{"string":"58.75"}}
Any suggestions please?
Created 05-15-2018 02:16 PM
What version of avro-tools are you using? You'll need something of version 1.8.x or later to support logical types. Also you won't see legit data for a decimal type as they are encoded in Avro as two-complement big-endian integer values, so they'll show up as something weird in the JSON output.
Created 05-15-2018 02:31 PM
I'm using avro-tools 1.8.2, Imapala is on 2.10.0-cdh5.13.1
I started checking the data because I was getting an error when trying to query the impala table, which was created from the parquet file produced by PutParquet. I've attached both the avro file from QueryDatabaseTable, and the parquet file produced by PutParquet.
Below is the code I used to create and then query the table (paths abbreviated for conciseness):
create external table test2
LIKE PARQUET '..../test/577533060432928'
<br>STORED AS PARQUET
<br>LOCATION '..../test'
select * from test2;
WARNINGS: File 'hdfs://nameservice1/.../test/577533060432928' has an incompatible Parquet schema for column 'test2.spread'. Column type: DECIMAL(10,0), Parquet schema: optional byte_array SPREAD [i:1 d:1 r:0]
Created 05-23-2018 03:49 PM
I left a response on my other answer but will leave it here too in case you hadn't seen it:
Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.
Created 05-15-2018 03:51 PM
I think I've found the issue:
https://issues-test.apache.org/jira/browse/IMPALA-2494
Other than upgrading Impala to 2.11 or trying to hack PutParquet to use fixed_len_byte_array rather than binary (is that viable?), are there any ways to work around this?
Edit: I'm thinking I can maybe map the avro decimal values to fixed type with some schema conversion between the QueryDatabaseTable and PutParquet processors. Would love to hear any other suggestions though.
Created 05-16-2018 11:52 AM
Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.