Sqoop has been popular to be a client interfacing between relational databases and Hive / HDFS etc for importing and exporting datasets. It also supports importing table to HBase as well. The most common use case is to import data in bulk from Databases such as Oracle for the initial loads and then using tools like GoldenGate to continuously replicate data. In this article, we will import 4 million rows from a Mysql table to HBase table. Steps are as follows:
Create a table in mysql
CREATE TABLE sqooptest (id int NOT NULL PRIMARY KEY AUTO_INCREMENT, val int);
Create a procedure in Mysql to generate some random data in the columns defined.
CREATE PROCEDURE prepare_data1()
DECLARE i INT DEFAULT 100;
WHILE i < 4000000 DO
INSERT INTO sqooptest (val) VALUES (i);
SET i = i + 1;END WHILE;
When you call this procedure , you'll have 4 million records in mysql table :
Lets Create HBase table from HBase shell.
Shell > create ‘sqooptest’, ‘A’
insert one dummy record.
Shell > put ‘sqooptest’, ‘1’, ‘A:val’, ‘123’
Now lets run Sqoop command to import data from mysql to HBase ( uses regular “put” internally ).There is another way and which is bulk loading using sqoop , we will discuss that shortly.
sqoop import --connect jdbc:mysql://example.com/test --driver com.mysql.jdbc.Driver--username root -P --table sqooptest --hbase-table sqooptest --hbase-row-key id --column-family A —split-by id -m 8
--driver specifies database driver
--connect indicates connection string to Mysql.
--username is database user
--table is database table name
--hbase-table is table name at HBase side.
--hbase-row-key is the primary key of source DB which you would like to consider as HBase row key. It would be a comma separated composite key as well.
-- column-family is the family where writes are going to land in HBase.
-- split-by is the column name at source DB which should be used to split the data to be imported across give number of mappers. So if you have defined 8 mappers for this job , column specified in this field would be divided equally in parts to be served by each mapper.
Some additional useful options:
--fetch-size is the number of rows which should be fetched by each mapper in one iteration from source DB. Increasing fetch-size corresponding to mapper container size (memory / CPU / bandwidth to be considered ) has proved to be beneficial in making the job faster and efficient.
--hbase-bulkload enables bulkloading of data in HBase.
Finally, the output of this job looks like any other regular mapreduce job.
18/05/16 06:42:52 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 32.4191 seconds (0 bytes/sec)18/05/16 06:42:52 INFO mapreduce.ImportJobBase: Retrieved 3999900 records.