Created 11-02-2017 08:05 PM
Step 1:
I Sqoop imported data into a HDFS location from Oracle in Avro format using following command.
sqoop import --connect jdbc:oracle:thin:system/system@192.XXX.56.188:1521:xe --username xxx -password xxx --table DOB --target-dir '/user/root/DOB_AVRO/' --as-avrodatafile --num-mappers 1;
Step 2:
Then Created a External table on top of that data using following command.
create external table avrodata_new
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> location '/user/root/DOB_AVRO/'
> TBLPROPERTIES ('avro.schema.url'='hdfs://sandbox.hortonworks.com/user/root/DOB.avsc');
Step 3:
hive> select * from avrodata_new;
OK
SOURYA 477360000000 PATTANAYAK 1
SONU 507945600000 RANA 2
SAROJ NULL BEHERA NULL
MANAS NULL NULL 3
TANKA 623980800000 GIRI 4
Time taken: 1.383 seconds, Fetched: 5 row(s)
Step 4:
Created another table orctab with schema in ORC format using following command.
hive> create external table orctab (name string,s_date date,l_name string,id int)
> row format delimited
> fields terminated by '\t'
> lines terminated by '\n'
> stored as orc
> location '/user/root/orctable';
OK Time taken: 1.868 seconds
Step 5:
Inserted the data from avrodata_new to orctab.
hive> insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new;
Step 6:
hive> describe orctab;
OK
name string
s_date date
l_name string
id int
Time taken: 0.675 seconds, Fetched: 4 row(s)
hive> select * from orctab;
OK
SOURYA 1985-02-16 PATTANAYAK 1
SONU 1986-02-05 RANA 2
SAROJ NULL BEHERA NULL
MANAS NULL NULL 3
TANKA 1989-10-10 GIRI 4
Time taken: 1.388 seconds, Fetched: 5 row(s)
QUESTION:
Now i have added 2 records to that HDFS location ('/user/root/DOB_AVRO/') or to avrodata_new table.
How can i import only those 2 new records from avrodata_new table to orctab table.
Created 11-02-2017 08:59 PM
one way is using insert statement with where clause then specify the filter to identify your new records from avro table.
hive# insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new where <your-filter-to-find-added-records>;
(or) Dropping and recreating table again
hive# drop table orctab; hive# insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new; //drop and reload the table again.
For these cases best practice would be
having time stamp column would be perfect to load only the changes that made in the base table i.e if there is any updates or appends to the table those records needs to have new time stamp value associated with them.
So lets consider you have loaded orc table until 2017-11-02 16:00:00 timestamp and then you have added 2 records into avro table with 2017-11-02 16:10:00 new time stamp associated with those new records.
Then our insert query would be
insert into orctab select name,to_date(timestamp(s_date)),l_name,cast(id as Int) from avrodata_new where timestamp_column >'2017-11-02 16:00:00'; //we are going to load only new records that got added in avro table
Created 11-03-2017 02:36 PM
@Shu Thank You very much for your brief explanation making it clear..
Actually i am working on a dummy table without timestamp.
Hope this will help me with timestamp column.