Created on 01-20-2016 01:46 PM - edited 09-16-2022 02:58 AM
Currently, all of our AMOUNT columns are defined as DOUBLE data types in our HIVE tables. We would like to convert them to DECIMAL(15,3) data type since according to Cloudera online documentation:
FLOAT and DOUBLE can cause problems or unexpected behavior due to inability to precisely represent certain fractional values, for example dollar and cents values for currency. You might find output values slightly different than you inserted, equality tests that do not match precisely, or unexpected values for GROUP BY columns. DECIMAL can help reduce unexpected behavior and rounding errors, at the expense of some performance overhead for assignments and comparisons.
However, since Avro doesn’t support DECIMAL, when I create an external HIVE table using Avro file format, I can only create DECIMAL column as DOUBLE data type. Is there any way which would allow me to create an external HIVE table with DECIMAL column data type while using Avro files?
Created 02-28-2016 03:09 AM
Hi ,
Starting from Hive 0.14 and Avro 1.7.1 (CDH 5.x.x) support for DECIMAL datatype has been added. Please refer the jira below for details
https://issues.apache.org/jira/browse/AVRO-1402
Following is the procedure to do the conversion. Change the schema as per your requirements, I am just outlining the way by which I did the transformation.
Step1: This is my test table with a column "c" in double datatype
show create table test1;
+-------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------+--+
| CREATE TABLE `test1`( |
| `a` int COMMENT 'from deserializer', |
| `b` int COMMENT 'from deserializer', |
| `c` double COMMENT 'from deserializer') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://nightly54-1.vpc.cloudera.com:8020/user/hive/warehouse/test1' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1', |
| 'rawDataSize'='0', |
| 'totalSize'='306', |
| 'transient_lastDdlTime'='1456655270') |
+-------------------------------------------------------------------------+--+
Step2: The header in avro data files in test1 table looks as follows
a. java -jar /usr/lib/avro/avro-tools-1.7.6-cdh5.4.10-SNAPSHOT.jar getschema 000000_0
{
"type" : "record",
"name" : "test1",
"namespace" : "default",
"fields" : [ {
"name" : "a",
"type" : [ "null", "int" ],
"doc" : "\u0000\u0000",
"default" : null
}, {
"name" : "b",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "c",
"type" : [ "null", "double" ],
"default" : null
} ]
}
Step3: For hive to read avro data with decimal datatype, the header in the avro datafile should look as follows
{
"type" : "record",
"name" : "test1",
"namespace" : "default",
"fields" : [ {
"name" : "a",
"type" : [ "null", "int" ],
"doc" : "\u0000\u0000",
"default" : null
}, {
"name" : "b",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "c",
"type" : [ "null", { "type" : "bytes", "logicalType" : "decimal", "precision" : 15, "scale" : 3 } ],
"default" : null
} ]
}
Hive serde reads the actual data file's header to read data so when you create external table by referring the schema file
WITH SERDEPROPERTIES ('avro.schema.url'='/path-to-decimal-datatype-schema-file') it won't workout and you will be receiving errors saying that
Error: java.io.IOException: org.apache.avro.AvroTypeException: Found double, expecting union (state=,code=0)
Step4: So we need to create another table with desired decimal datatype and import the content of original table to new temp table, so that data with appropriate header for decimal datatype gets created
a. create table temp(a int,b int,c decimal(15,3))stored as avro; //* which will create schema in avro file as mentioned in step3 *//
b. insert overwrite table temp as select * from test1;
Now we have the data in decimal datatype in avro format and can be confirmed with describe statement and also by looking into the data stored.
Thanks