Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can DECIMAL data type be defined in a HIVE external table when using Avro file format?

avatar
New Contributor

Currently, all of our AMOUNT columns are defined as DOUBLE data types in our HIVE tables.  We would like to convert them to DECIMAL(15,3) data type since according to Cloudera online documentation:

 

FLOAT and DOUBLE can cause problems or unexpected behavior due to inability to precisely represent certain fractional values, for example dollar and cents values for currency. You might find output values slightly different than you inserted, equality tests that do not match precisely, or unexpected values for GROUP BY columns. DECIMAL can help reduce unexpected behavior and rounding errors, at the expense of some performance overhead for assignments and comparisons.

 

However, since Avro doesn’t support DECIMAL, when I create an external HIVE table using Avro file format, I can only create DECIMAL column as DOUBLE data type.  Is there any way which would allow me to create an external HIVE table with DECIMAL column data type while using Avro files?

1 REPLY 1

avatar
Master Collaborator

Hi ,

Starting from Hive 0.14 and Avro 1.7.1 (CDH 5.x.x) support for DECIMAL datatype has been added. Please refer the jira below for details

https://issues.apache.org/jira/browse/AVRO-1402

Following is the procedure to do the conversion. Change the schema as per your requirements, I am just outlining the way by which I did the transformation.

 

Step1: This is my test table with a column "c" in double datatype

show create table test1;
+-------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------+--+
| CREATE TABLE `test1`( |
| `a` int COMMENT 'from deserializer', |
| `b` int COMMENT 'from deserializer', |
| `c` double COMMENT 'from deserializer') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://nightly54-1.vpc.cloudera.com:8020/user/hive/warehouse/test1' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1', |
| 'rawDataSize'='0', |
| 'totalSize'='306', |
| 'transient_lastDdlTime'='1456655270') |
+-------------------------------------------------------------------------+--+

Step2: The header in avro data files in test1 table looks as follows

 

a. java -jar /usr/lib/avro/avro-tools-1.7.6-cdh5.4.10-SNAPSHOT.jar getschema 000000_0

 

{
"type" : "record",
"name" : "test1",
"namespace" : "default",
"fields" : [ {
"name" : "a",
"type" : [ "null", "int" ],
"doc" : "\u0000\u0000",
"default" : null
}, {
"name" : "b",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "c",
"type" : [ "null", "double" ],
"default" : null
} ]
}

 

Step3: For hive to read avro data with decimal datatype, the header in the avro datafile should look as follows

{
"type" : "record",
"name" : "test1",
"namespace" : "default",
"fields" : [ {
"name" : "a",
"type" : [ "null", "int" ],
"doc" : "\u0000\u0000",
"default" : null
}, {
"name" : "b",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "c",
"type" : [ "null", { "type" : "bytes", "logicalType" : "decimal", "precision" : 15, "scale" : 3 } ],
"default" : null
} ]
}


Hive serde reads the actual data file's header to read data so when you create external table by referring the schema file
WITH SERDEPROPERTIES ('avro.schema.url'='/path-to-decimal-datatype-schema-file') it won't workout and you will be receiving errors saying that

Error: java.io.IOException: org.apache.avro.AvroTypeException: Found double, expecting union (state=,code=0)

 

Step4: So we need to create another table with desired decimal datatype and import the content of original table to new temp table, so that data with appropriate header for decimal datatype gets created

 

a. create table temp(a int,b int,c decimal(15,3))stored as avro; //* which will create schema in avro file as mentioned in step3 *//

b. insert overwrite table temp as select * from test1;

 

Now we have the data in decimal datatype in avro format and can be confirmed with describe statement and also by looking into the data stored.

 

Thanks