Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Avro logical types data format from QueryDatabaseTable

avatar
Explorer

I'm attempting to use Nifi to take data from an Oracle database, and then write it as Parquet files to HDFS, from which I'll then build tables for use in Impala.

I've setup a QueryDatabaseTable processor and linked it up to PutParquet. By default everything works, but all the values are Strings rather than having any dates, ints, doubles etc. After some quick research I realised I need to set the "Use Avro Logical Types" to true in the QueryDatabaseTable, so I did this.

However, when I look at the Avro output from a simple test query, it doesn't look right. Running it through avro-tools tojson I get:
{"TRADEID":{"string":"71416037"},"SPREAD":{"bytes":";"}}

Why would bytes be showing '";"? I'm expecting a value of 58.75 here.

The schema looks ok:
{ "type" : "record", "name" : "CDS_CTC", "namespace" : "VOL_DATA", "fields" : [ { "name" : "TRADEID", "type" : [ "null", "string" ] }, { "name" : "SPREAD", "type" : [ "null", { "type" : "bytes", "logicalType" : "decimal", "precision" : 10, "scale" : 0 } ] } ] }

and without using avro logical types, I get the values I'd expect:
{"TRADEID":{"string":"71416037"},"SPREAD":{"string":"58.75"}}

Any suggestions please?

5 REPLIES 5

avatar
Master Guru

What version of avro-tools are you using? You'll need something of version 1.8.x or later to support logical types. Also you won't see legit data for a decimal type as they are encoded in Avro as two-complement big-endian integer values, so they'll show up as something weird in the JSON output.

avatar
Explorer

I'm using avro-tools 1.8.2, Imapala is on 2.10.0-cdh5.13.1

I started checking the data because I was getting an error when trying to query the impala table, which was created from the parquet file produced by PutParquet. I've attached both the avro file from QueryDatabaseTable, and the parquet file produced by PutParquet.

Below is the code I used to create and then query the table (paths abbreviated for conciseness):

create external table test2
LIKE PARQUET '..../test/577533060432928'
<br>STORED AS PARQUET
<br>LOCATION '..../test'

select * from test2;
WARNINGS: File 'hdfs://nameservice1/.../test/577533060432928' has an incompatible Parquet schema for column 'test2.spread'. Column type: DECIMAL(10,0), Parquet schema: optional byte_array SPREAD [i:1 d:1 r:0]

avro-parquet.zip

avatar
Master Guru

I left a response on my other answer but will leave it here too in case you hadn't seen it:

Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.

avatar
Explorer

I think I've found the issue:
https://issues-test.apache.org/jira/browse/IMPALA-2494

Other than upgrading Impala to 2.11 or trying to hack PutParquet to use fixed_len_byte_array rather than binary (is that viable?), are there any ways to work around this?

Edit: I'm thinking I can maybe map the avro decimal values to fixed type with some schema conversion between the QueryDatabaseTable and PutParquet processors. Would love to hear any other suggestions though.

avatar
Master Guru

Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.