Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

AVRO Schema evolution, not able to query old data

Highlighted

AVRO Schema evolution, not able to query old data

New Contributor

Hi,

I have created a AVRO format External Hive table. Then I added few more columns in AVSC and recreated new Hive table, with new columns having default values. But when I am querying the table, old data is not appearing.

Created old table as-

SET avro.output.codec=snappy; CREATE EXTERNAL TABLE finpolicy1 PARTITIONED BY (ds string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/mamta.chawla/fin/avro' TBLPROPERTIES ('avro.schema.url'='/user/mamta.chawla/fin/avsc/fin_1.avsc');

Created new table as

SET avro.output.codec=snappy; CREATE EXTERNAL TABLE finpolicy1 PARTITIONED BY (ds string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/mamta.chawla/fin/avro' TBLPROPERTIES ('avro.schema.url'='/user/mamta.chawla/fin/avsc/fin_2.avsc');

fin_1.avsc is

{ "namespace": "testing.hive.avro.serde", "name": "cards", "type": "record", "fields": [ { "name":"batchID", "type":"string", "doc":"Order of playing the role" }, { "name":"color", "type":"string", "doc":"Order of playing the role" }, { "name":"suit", "type":"string", "doc":"card suit" }]}

fin_2.avsc is

{ "namespace": "testing.hive.avro.serde", "name": "cards", "type": "record", "fields": [ { "name":"batchID", "type":"string", "doc":"Order of playing the role" }, { "name":"color", "type":"string", "doc":"Order of playing the role" }, { "name":"suit", "type":"string", "doc":"card suit" }, { "name":"PIA", "type":"string", "doc":"last name of actor playing role", "default":"" }]}

When I query data after schema evolution and query the table as select * from finpolicy, I am getting no record returned.

I am loading data into avro table from text table using

hive -e "insert into table ${avro_table} PARTITION(ds='Q1') select t.* from ${tabName} t;"

Is there I am missing something??

Regards

Mamta Chawla

1 REPLY 1

Re: AVRO Schema evolution, not able to query old data

Super Guru

Hi @Mamta Chawla,
You are having finpolicy1 table with 4(batchid,color,suit,ds(partition)) columns.

For finpolicy2 table you have added another column so there are 5(batchid,color,suit,PIA,ds(partition)). both tables are pointing to same location on HDFS i.e /user/mamta.chawla/fin/avro with different schemas.

when you are trying to insert data into finpolicy2 from text table with static partition value ds='Q1' it will create the below partition directory

drwxr-xr-x - cloudera supergroup 0 2017-09-24 14:54 /user/mamta.chawla/fin/avro/ds=Q1 

which having new finpolicy2(5) columns in it.

Reason no record returned:-
we have only inserted records from text table to finpolicy2 table, if you want to see the records in finpolicy1 table then do

msck repair table finpolicy1;

(or)

alter table finpolicy1 add partition(ds='Q1') location '/user/mamta.chawla/fin/avro/ds=Q1';

Keep in mind as we are having 4 columns in finpolicy1 if you did

insert into table finpolicy1 partition(ds='Q1') select batchid,color,suit from text_table;

this means we are getting only batchid,color,suit from text table so we are missing PIA column

msck repair table finpolicy2;

the result for finpolicy2 will have NULL in PIA column.

it's better to load data to finpolicy2 table as it is having all the columns that we are using for finpolicy1

insert into table finpolicy2 partition(ds='Q1') select * from text_table;

then do

msck repair table finpolicy1; 
select * from finpolicy1;

now the table finpolicy1 will have all the data for batchid,color,suit columns because those columns are available in finpolicy2 table.

Don't have an account?
Coming from Hortonworks? Activate your account here