is a tool designed to transfer data between Hadoop and relational databases or
mainframes. You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop MapReduce, and
then export the data back into an RDBMS.
automates most of this process, relying on the database to describe the schema
for the data to be imported. Sqoop uses MapReduce to import and export the
data, which provides parallel operation as well as fault tolerance.
Sqoop Performance Tuning Best Practices
Tune the following Sqoop arguments in JDBC connection or
Sqoop mapping to optimize performance
split-by and boundary-query•
2.Inserting Data in Batches
Specifies that you can group the related SQL statements into
a batch when you export data.
The JDBC interface exposes an API for doing batches in a
prepared statement with multiple sets of values. With the --batch parameter,
Sqoop can take advantage of this. This API is present in all JDBC drivers
because it is required by the JDBC interface.
The default values can vary from connector to connector.
Sqoop defaults to disabled batching and to 100 for both sqoop.export.records.per.statementand sqoop.export.statements.per.transactionproperties.
2.Custom Boundary Queries
Specifies the range of values that you can import. You can
use boundary-query if you do not get the desired results by using the split-by
When you configure the boundary-query argument, you must
specify the min(id) and max(id) along with the table name. If you do not
configure the argument, Sqoop runs the following query.
Specifies the direct import fast path when you
import data from RDBMS.
Rather than using the JDBC interface for
transferring data, the direct mode delegates the job of transferring data to
the native utilities provided by the database vendor. In the case of MySQL, the
mysqldump and mysqlimport will be used for retrieving data from the database
server or moving data back. In the case of PostgreSQL, Sqoop will take
advantage of the pg_dump utility to import data. Using native utilities will
greatly improve performance, as they are optimized to provide the best possible
transfer speed while putting less burden on the database server. There are
several limitations that come with this faster import. For one, not all
databases have available native utilities. This mode is not available for every
supported database. Out of the box, Sqoop has direct support only for MySQL and
Specifies the number of entries that Sqoop can
import at a time.
Use the following syntax:
Where <n> represents the number of entries
that Sqoop must fetch at a time. Default is 1000.
Increase the value of the fetch-size argument based
on the volume of data that need to read. Set the value based on the available
memory and bandwidth.
Specifies number of map tasks that can run in parallel.
Default is 4. To optimize performance, set the number of map tasks to a value
lower than the maximum number of connections that the database supports.
Use the parameter --num-mappers if you want Sqoop to use a
different number of mappers. For example, to suggest 10 concurrent tasks, use
the following Sqoop command:
Controlling the amount of parallelism that Sqoop will use to
transfer data is the main way to control the load on your database. Using more
mappers will lead to a higher number of concurrent data transfer tasks, which
can result in faster job completion. However, it will also increase the load on
the database as Sqoop will execute more concurrent queries.
Specifies the column name based on which Sqoop must split
the work units.