Support Questions

Find answers, ask questions, and share your expertise

Importing updated records from one hive table (avro) to another hive table (orc)

New Contributor

Step 1:

I Sqoop imported data into a HDFS location from Oracle in Avro format using following command.

sqoop import --connect jdbc:oracle:thin:system/system@192.XXX.56.188:1521:xe --username xxx -password xxx --table DOB --target-dir '/user/root/DOB_AVRO/' --as-avrodatafile --num-mappers 1;

Step 2:

Then Created a External table on top of that data using following command.

create external table avrodata_new

> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

> STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> location '/user/root/DOB_AVRO/'

> TBLPROPERTIES ('avro.schema.url'='hdfs://sandbox.hortonworks.com/user/root/DOB.avsc');

Step 3:

hive> select * from avrodata_new;

OK

SOURYA 477360000000 PATTANAYAK 1

SONU 507945600000 RANA 2

SAROJ NULL BEHERA NULL

MANAS NULL NULL 3

TANKA 623980800000 GIRI 4

Time taken: 1.383 seconds, Fetched: 5 row(s)

Step 4:

Created another table orctab with schema in ORC format using following command.

hive> create external table orctab (name string,s_date date,l_name string,id int)

> row format delimited

> fields terminated by '\t'

> lines terminated by '\n'

> stored as orc

> location '/user/root/orctable';

OK Time taken: 1.868 seconds

Step 5:

Inserted the data from avrodata_new to orctab.

hive> insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new;

Step 6:

hive> describe orctab;

OK

name string

s_date date

l_name string

id int

Time taken: 0.675 seconds, Fetched: 4 row(s)

hive> select * from orctab;

OK

SOURYA 1985-02-16 PATTANAYAK 1

SONU 1986-02-05 RANA 2

SAROJ NULL BEHERA NULL

MANAS NULL NULL 3

TANKA 1989-10-10 GIRI 4

Time taken: 1.388 seconds, Fetched: 5 row(s)

QUESTION:

Now i have added 2 records to that HDFS location ('/user/root/DOB_AVRO/') or to avrodata_new table.

How can i import only those 2 new records from avrodata_new table to orctab table.

2 REPLIES 2

Super Guru
@LOKASHIS RANA

one way is using insert statement with where clause then specify the filter to identify your new records from avro table.

hive# insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new where <your-filter-to-find-added-records>;
(or)
Dropping and recreating table again
hive# drop table orctab; hive# insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new; //drop and reload the table again.

For these cases best practice would be

having time stamp column would be perfect to load only the changes that made in the base table i.e if there is any updates or appends to the table those records needs to have new time stamp value associated with them.

So lets consider you have loaded orc table until 2017-11-02 16:00:00 timestamp and then you have added 2 records into avro table with 2017-11-02 16:10:00 new time stamp associated with those new records.

Then our insert query would be

insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new where timestamp_column >'2017-11-02 16:00:00'; //we are going to load only new records that got added in avro table

New Contributor

@Shu Thank You very much for your brief explanation making it clear..

Actually i am working on a dummy table without timestamp.

Hope this will help me with timestamp column.