Community Articles

gsharma · ‎05-16-2018

Sqoop has been popular to be a client interfacing between relational databases and Hive / HDFS etc for importing and exporting datasets. It also supports importing table to HBase as well. The most common use case is to import data in bulk from Databases such as Oracle for the initial loads and then using tools like GoldenGate to continuously replicate data. In this article, we will import 4 million rows from a Mysql table to HBase table. Steps are as follows:

Create a table in mysql

CREATE TABLE sqooptest (id int NOT NULL PRIMARY KEY AUTO_INCREMENT, val int);

Create a procedure in Mysql to generate some random data in the columns defined.

DELIMITER $$

CREATE PROCEDURE prepare_data1()

BEGIN
DECLARE i INT DEFAULT 100;
WHILE i < 4000000 DO
INSERT INTO sqooptest (val) VALUES (i);
SET i = i + 1;END WHILE;
END$$
DELIMITER ;

When you call this procedure , you'll have 4 million records in mysql table :

CALL prepare_data1();

Lets Create HBase table from HBase shell.

Shell > create ‘sqooptest’, ‘A’

insert one dummy record.

Shell > put ‘sqooptest’, ‘1’, ‘A:val’, ‘123’

Now lets run Sqoop command to import data from mysql to HBase ( uses regular “put” internally ).There is another way and which is bulk loading using sqoop , we will discuss that shortly.

sqoop import --connect jdbc:mysql://example.com/test --driver com.mysql.jdbc.Driver--username root  -P --table sqooptest --hbase-table sqooptest --hbase-row-key id --column-family A —split-by id -m 8

Where

--driver specifies database driver

--connect indicates connection string to Mysql.

--username is database user

--table is database table name

--hbase-table is table name at HBase side.

--hbase-row-key is the primary key of source DB which you would like to consider as HBase row key. It would be a comma separated composite key as well.

-- column-family is the family where writes are going to land in HBase.

-- split-by is the column name at source DB which should be used to split the data to be imported across give number of mappers. So if you have defined 8 mappers for this job , column specified in this field would be divided equally in parts to be served by each mapper.

Some additional useful options:

--fetch-size is the number of rows which should be fetched by each mapper in one iteration from source DB. Increasing fetch-size corresponding to mapper container size (memory / CPU / bandwidth to be considered ) has proved to be beneficial in making the job faster and efficient.

--hbase-bulkload enables bulkloading of data in HBase.

Finally, the output of this job looks like any other regular mapreduce job.

18/05/16 06:42:52 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 32.4191 seconds (0 bytes/sec)18/05/16 06:42:52 INFO mapreduce.ImportJobBase: Retrieved 3999900 records.

References:

http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_importing_data_into_hbase

daniel_zafar · ‎08-15-2018

I've heard that Sqoop 1.4.7 enables import through Phoenix directly. Can you please add a section on how to do that with HDP 3.0 or a new post?

gsharma · ‎08-21-2018

Phoenix shipped with HDP does not support import from Sqoop yet.

Cloudera Community

Community Articles

Importing Tables from relational database to HBase using Sqoop

Apache HBase

Apache Sqoop

Re: Importing Tables from relational database to HBase using Sqoop

Re: Importing Tables from relational database to HBase using Sqoop

Import RDBMS into Hive table stored as ORC with SQ...

Sqoop imports from oracle, informix and mysql:

Sqoop Import Job Stuck

Import HBase data in csv format using pig

Does Hive table to relational database export poss...

sqoop import when source table contains string wit...

Sqoop import from Teradata using hive-import into ...

sqoop import issue

[Solved] : sqoop-import MsSQL table into HDFS

sqoop unable to import data from two tables