Community Articles

aervits · ‎02-18-2017

In this tutorial, I will walk you through creating a Sqoop action using WFM on HDP 2.5+.

First we need a table, we're going to use MySQL as source database and table.

create table imported (rowkey int, value varchar(25));
insert into imported (rowkey, value) values (1, "john doe");
insert into imported (rowkey, value) values (2, "jane doe");

I want to make sure that all cluster nodes can access this table and going to grant access to user centos on the LAN, you may have different restrictions on the network and by all means consult your DBAs.

GRANT ALL PRIVILEGES ON *.* TO 'centos'@'172.22.65.%'
    IDENTIFIED BY 'password'
    WITH GRANT OPTION;
FLUSH PRIVILEGES;

GRANT ALL PRIVILEGES ON *.* TO 'centos'@'localhost'
    IDENTIFIED BY 'password'
    WITH GRANT OPTION;
FLUSH PRIVILEGES;

I want to make sure user centos can access the table

mysql –u centos –p
➢ password

mysql> select * from test.imported;
+--------+----------+
| rowkey | value    |
+--------+----------+
|      1 | john doe |
|      2 | jane doe |
+--------+----------+

Finally, I'd like to test my sqoop works

sqoop list-tables --connect jdbc:mysql://source-1/test --username centos --password password

17/02/18 15:13:13 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
imported

Also, in case of Oozie with HCatalog and Sqoop, every node that will execute job attempts must have HCat and Sqoop client installed.

I want to save a password in a file so that I could access it w/out a prompt and not in clear text.

echo -n "password" > .password 
hdfs dfs -put .password /user/$USER/ 
hdfs dfs -chmod 400 /user/$USER/.password 
rm .password

[centos@source-1 ~]$ hdfs dfs -ls
Found 1 items
-r--------   3 centos hdfs          8 2017-02-18 15:13 .password
[centos@source-1 ~]$ hdfs dfs -cat .password
password[centos@source-1 ~]$

Let's run the list command again referencing the file instead of --password argument

sqoop list-tables --connect jdbc:mysql://source-1/test --username centos --password-file /user/centos/.password

17/02/18 15:14:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
imported

You can find more details in our comprehensive documentation on data movement http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-movement-and-integration/content/sqo...

Also, make sure mysql-connector-java is at an appropriate version. RHEL 6 bundles version 5.1.17 which does not work in later versions of HDP, we bundle 5.1.37 in HDP-UTILS and the only way to active it is to run the following

yum downgrade mysql-connector-java

then in your /usr/share/java directory you should be able to see correct connectors

lrwxrwxrwx.  1 root root       31 Feb 18 15:29 jdbc-mysql.jar -> mysql-connector-java-5.1.37.jar
lrwxrwxrwx.  1 root root       31 Feb 18 15:29 mysql-connector-java.jar -> mysql-connector-java-5.1.37.jar

You have a choice to update the Oozie sharelib with this connector or bundle it as part of workflow lib. I'm going to do the latter for time's sake.

Before I start authoring a workflow, I'd like to confirm my sqoop import works, I will execute it on the command line first

sqoop import --connect jdbc:mysql://172.22.65.123/test --username centos --password-file /user/$USER/.password --table imported --hcatalog-table imported --create-hcatalog-table --hcatalog-storage-stanza "STORED AS ORCFILE" --hcatalog-home /usr/hdp/current/hive-webhcat --map-column-hive value=STRING --split-by rowkey

I'm choosing HCatalog import as it is more efficient than --hive-import, in case of latter, it needs to make one extra step of moving imported data from staging directory to hive and spawning an extra container. With --hcatalog-table everything happens in one shot. Also, benefit here is that you can create an ORC table from the command line instead of going into Hive and altering a table to set it to ORC.

Let's see what we got as a result

Map-Reduce Framework
  Map input records=2
  Map output records=2
  Input split bytes=213
  Spilled Records=0
  Failed Shuffles=0
  Merged Map outputs=0
  GC time elapsed (ms)=837
  CPU time spent (ms)=8890
  Physical memory (bytes) snapshot=718036992
  Virtual memory (bytes) snapshot=9154256896
  Total committed heap usage (bytes)=535298048
 File Input Format Counters
  Bytes Read=0
 File Output Format Counters
  Bytes Written=0
17/02/18 15:32:05 INFO mapreduce.ImportJobBase: Transferred 628 bytes in 70.6267 seconds (8.8918 bytes/sec)
17/02/18 15:32:05 INFO mapreduce.ImportJobBase: Retrieved 2 records.
17/02/18 15:32:05 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners

Let's see what it looks like in Hive

[centos@source-1 ~]$ beeline
Beeline version 1.2.1000.2.6.0.0-493 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000 "" ""
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 1.2.1000.2.6.0.0-493)
Driver: Hive JDBC (version 1.2.1000.2.6.0.0-493)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> select * from default.imported;
+------------------+-----------------+--+
| imported.rowkey  | imported.value  |
+------------------+-----------------+--+
| 1                | john doe        |
| 2                | jane doe        |
+------------------+-----------------+--+
2 rows selected (6.414 seconds)

Let's truncate the table in order to prepare for Oozie imports and additionally describe the table to demonstrate it is in fact in ORC

0: jdbc:hive2://localhost:10000> truncate table default.imported;
No rows affected (0.4 seconds)

0: jdbc:hive2://localhost:10000> describe formatted imported;

| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                         | NULL                        |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                   | NULL                        |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

I'm ready to start working on a workflow, let's import sqoop action and save workflow to create a directory for it.

I want to make sure I have a valid directory so I could upload a few files that are necessary for this to complete successfully.

hdfs dfs -mkdir /user/centos/sqoop/lib
hdfs dfs -put /usr/share/java/mysql-connector-java-5.1.37.jar /user/centos/sqoop/lib/
hdfs dfs -put /etc/hive/conf/hive-site.xml /user/centos/sqoop/lib/
hdfs dfs -put /etc/tez/conf/tez-site.xml /user/centos/sqoop/lib/

I'm going to use my own MySQL driver than the one in the sharelib and therefore I'm uploading it to my wf, again if you update the sharelib with associated jar you don't have to do that. Secondly, I'm going to include hive-site.xml and tez-site.xml. Until 2.5, you only needed hive-site.xml but now we also need tez-site.xml. It is a small fact that will save you a lot of hours of debugging, trust me I know.

Your wf lib directory should look like so

hdfs dfs -ls /user/centos/sqoop/lib/
Found 3 items
-rw-r--r--   3 centos hdfs      19228 2017-02-18 15:38 /user/centos/sqoop/lib/hive-site.xml
-rw-r--r--   3 centos hdfs     977873 2017-02-18 15:37 /user/centos/sqoop/lib/mysql-connector-java-5.1.37.jar
-rw-r--r--   3 centos hdfs       6737 2017-02-18 15:38 /user/centos/sqoop/lib/tez-site.xml

Finally, I want to modify my sqoop command as I no longer need --create-hcatalog-table command and want to replace $USER argument with my username, you can also use Oozie EL functions for string replacement.

import --connect jdbc:mysql://172.22.65.123/test --username centos --password-file /user/centos/.password --table imported --hcatalog-table imported --hcatalog-home /usr/hdp/current/hive-webhcat --map-column-hive value=STRING --split-by rowkey

That's what my command will look like in Oozie, notice missing "sqoop" command, it's inferred when you select sqoop action in WFM.

Edit the sqoop action on WFM canvas and enter the command in. We are working on refreshing the UI before WFM is released and your dialog box may look slightly different but fields should remain the same.

Let's tell WFM that we also expect tez-site.xml and hive-site.xml files

Finally, we need to tell Oozie that we will pull in HCatalog and Hive jars for this to work

At this point my wf is finished, let's inspect the XML.

When you submit the job, it should succeed and you can look at the results.

Again, this is more efficient and actually works on HDP 2.5+, I highly recommend checking out WFM and hcatalog options in Sqoop.

Cloudera Community

Community Articles

Apache Ambari Workflow Manager View for Apache Oozie: Part 8 (Sqoop with HCatalog)

Apache Ambari

Apache Oozie

Apache Sqoop