Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Guru

in the documentation, you can find on the Sqoop part

You can use Sqoop to import data into HDFS or directly into Hive. However, Sqoop can only import data into Hive as a text file or as a SequenceFile. To use the ORC file format, you must use a two-phase approach: first use Sqoop to move the data into HDFS, and then use Hive to convert the data into the ORC file format[...]

However, we can use the Sqoop-HCatalog integration feature, which is a table abstraction.

Let's use the HDP 2.3.2 sandbox, starting with creating our Hive table, stored as ORC:

[root@sandbox ~]# hive 
hive> CREATE TABLE cds (id int, artist string, album string) STORED AS ORCFILE; 
hive> INSERT INTO TABLE cds values (1,"The Shins","Port of Morrow"); 
hive> select * from cds; 
1     The Shins     Port of Morrow

Now let's build our MySQL table

[root@sandbox ~]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 85
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> use test;
Database changed
mysql> CREATE TABLE albums (id INT, artist varchar(255), album varchar(255), primary key(id));
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO albums VALUES (2, 'Family Of The Year', 'Loma Vista'), (3, 'Michel Petrucciani', 'Trio in Tokyo');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> GRANT ALL PRIVILEGES ON * . * TO 'newbie'@'localhost';
Query OK, 0 rows affected (0.01 sec)

mysql> exit;
Bye

Now the import part, with the added "--driver com.mysql.jdbc.Driver" to avoid error like

16/01/13 22:04:14 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@4d574915 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

as described by @Mark Lochbihler in this HCC article

[root@sandbox ~]# sqoop import --connect jdbc:mysql://localhost/test --username newbie --table albums --hcatalog-table cds --driver com.mysql.jdbc.Driver
[...]
16/01/13 22:58:20 INFO mapreduce.ImportJobBase: Beginning import of albums
16/01/13 22:58:21 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM albums AS t WHERE 1=0
16/01/13 22:58:21 INFO hcat.SqoopHCatUtilities: Configuring HCatalog for import job
[...]
16/01/13 23:00:03 INFO mapreduce.ImportJobBase: Retrieved 2 records.

We retrieved our 2 records, let's check that in Hive :

hive> select * from cds;
OK
1  The Shins  Port of Morrow
2  Family Of The Year  Loma Vista
3  Michel Petrucciani  Trio in Tokyo
Time taken: 11.485 seconds, Fetched: 3 row(s)
15,760 Views
Comments
avatar
New Contributor

Instead of pre creating the hive table and then doing a sqoop import, this can be replaced by creating a hive table from the sqoop query itself. Below is the example:

sqoop import --connect jdbc:mysql://localhost/test --username newbie --table albums --m 1 --hcatalog-table cds --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' --driver com.mysql.jdbc.Driver

avatar

Hi @Laurent Edel - Nice Article! I do have a question - are there performance issues when using this method (HCatalog integration) to go from Sqoop directly to an ORC format?

In other words, is Option A:

Sqoop -> directly to ORC format table via HCatalog integration

equivalent performance (or better performance) when compared to Option B?

Sqoop -> text files/external Hive table -> Hive CTAS/Insert into ORC format table

Would like to ensure the best possible Sqoop performance.

Thanks!

avatar
New Contributor

Laurent Edel

I tried your query like same as this-

sqoop import --connect jdbc:mysql://localhost:3306/office --username root -P --table abc --hcatalog-table abc --driver com.mysql.jdbc.Driver --m 1

But, facing a error, even i can't understand this error.

2018-12-20 11:07:11,484 INFO [main] conf.HiveConf: Found configuration file file:/usr/lib/hive/conf/hive-site.xml
2018-12-20 11:07:11,992 WARN [main] conf.HiveConf: HiveConf of name hive.enforce.bucketing does not exist
2018-12-20 11:07:12,370 INFO [main] metastore.HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
2018-12-20 11:07:12,538 INFO [main] metastore.ObjectStore: ObjectStore, initialize called
2018-12-20 11:07:12,989 INFO [main] DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
2018-12-20 11:07:12,991 INFO [main] DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
Thu Dec 20 11:07:14 IST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

2018-12-20 11:07:19,083 INFO [main] metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
2018-12-20 11:07:19,091 INFO [main] metastore.ObjectStore: Initialized ObjectStore
2018-12-20 11:07:20,921 INFO [main] metastore.HiveMetaStore: Added admin role in metastore
2018-12-20 11:07:20,954 INFO [main] metastore.HiveMetaStore: Added public role in metastore
2018-12-20 11:07:21,114 INFO [main] metastore.HiveMetaStore: No user is added in admin role, since config is empty
2018-12-20 11:07:22,091 INFO [main] metastore.HiveMetaStore: 0: get_databases: NonExistentDatabaseUsedForHealthCheck
2018-12-20 11:07:22,093 INFO [main] HiveMetaStore.audit: ugi=hduser ip=unknown-ip-addr cmd=get_databases: NonExistentDatabaseUsedForHealthCheck
2018-12-20 11:07:22,230 INFO [main] metastore.HiveMetaStore: 0: get_table : db=default tbl=abc
2018-12-20 11:07:22,231 INFO [main] HiveMetaStore.audit: ugi=hduser ip=unknown-ip-addr cmd=get_table : db=default tbl=abc
2018-12-20 11:07:22,301 ERROR [main] tool.ImportTool: ngHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy8.get_table(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1332)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:150)
at com.sun.proxy.$Proxy9.getTable(Unknown Source)
at org.apache.hive.hcatalog.common.HCatUtil.getTable(HCatUtil.java:180)
at org.apache.hive.hcatalog.mapreduce.InitializeInput.getInputJobInfo(InitializeInput.java:105)
at org.apache.hive.hcatalog.mapreduce.InitializeInput.setInput(InitializeInput.java:88)
at org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(HCatInputFormat.java:95)
... 14 more

So, what is the solution now ? I am confused.

Please help.