Created 02-03-2016 04:01 AM
On HDP 2.3.4, using Oozie 4.2.0 and Sqoop 1.4.2, I'm trying to create a coordinator app that will execute sqoop jobs on a daily basis. I need the sqoop action to execute jobs because these are incremental imports.
I've configured `sqoop-site.xml` and started the `sqoop-metastore` and I'm able to create, list, and delete jobs via the command line but the workflow encounters the error: Cannot restore job: streamsummary_incremental
stderr
Sqoop command arguments : job --exec streamsummary_incremental Fetching child yarn jobs tag id : oozie-26fcd4dc0afd8f53316fc929ac38eae2 2016-02-03 09:46:47,193 INFO [main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at <myhost>/<myIP>:8032 Child yarn jobs are found - ================================================================= >>> Invoking Sqoop command line now >>> 2241 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration. 2016-02-03 09:46:47,404 WARN [main] tool.SqoopTool (SqoopTool.java:loadPluginsFromConfDir(177)) - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration. 2263 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6.2.3.4.0-3485 2016-02-03 09:46:47,426 INFO [main] sqoop.Sqoop (Sqoop.java:<init>(97)) - Running Sqoop version: 1.4.6.2.3.4.0-3485 2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage - Cannot restore job: streamsummary_incremental 2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(254)) - Cannot restore job: streamsummary_incremental 2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage - (No such job) 2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(255)) - (No such job) 2553 [main] ERROR org.apache.sqoop.tool.JobTool - I/O error performing job operation: java.io.IOException: Cannot restore missing job streamsummary_incremental at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.read(HsqldbJobStorage.java:256) at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:198) at org.apache.sqoop.tool.JobTool.run(JobTool.java:283) at org.apache.sqoop.Sqoop.run(Sqoop.java:148) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235) at org.apache.sqoop.Sqoop.main(Sqoop.java:244) at org.apache.oozie.action.hadoop.SqoopMain.runSqoopJob(SqoopMain.java:197) at org.apache.oozie.action.hadoop.SqoopMain.run(SqoopMain.java:177) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47) at org.apache.oozie.action.hadoop.SqoopMain.main(SqoopMain.java:46) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:241) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
sqoop-site.xml
<property> <name>sqoop.metastore.client.enable.autoconnect</name> <value>false</value> <description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:hsqldb:hsql://<myhost>:12345</value> <description>The connect string to use when connecting to a job-management metastore. If unspecified, uses ~/.sqoop/. You can specify a different path here. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.username</name> <value>SA</value> <description>The username to bind to the metastore. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.password</name> <value></value> <description>The password to bind to the metastore. </description> </property> <property> <name>sqoop.metastore.server.location</name> <value>/tmp/sqoop-metastore/shared.db</value> <description>Path to the shared metastore database files. If this is not set, it will be placed in ~/.sqoop/. </description> </property> <property> <name>sqoop.metastore.server.port</name> <value>12345</value> <description>Port that this metastore should listen on. </description> </property>
workflow.xml
<action name="sqoop-import-job"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outputDir}"/> </prepare> <arg>job</arg> <arg>--exec</arg> <arg>${jobId}</arg> </sqoop> <ok to="hive-load"/> <error to="kill-sqoop"/> </action>
Additional info:
I'm thinking maybe Oozie isn't able to connect to the metastore because we don't have sqoop server? Could anyone confirm this? If not that, could I have missed anything else?
Thanks!
Created 02-04-2016 01:37 AM
After some more testing, I finally resolved this issue by explicitly passing in the metastore URL in the workflow.xml like so:
<arg>job</arg> <arg>--meta-connect</arg> <arg>jdbc:hsqldb:hsql://<myhost>:12345/sqoop</arg> <arg>--exec</arg> <arg>myjob</arg>
I'm not exactly sure, but I think that Oozie tries to connect to a local metastore because it doesn't have a copy of `sqoop-site.xml` and so it doesn't know the metastore url (even though I'm running on a single-node configuration)?
Created 02-03-2016 03:01 PM
please issue sqoop job --list from this node and from the node where oozie server runs.
Created 02-04-2016 12:35 AM
@Artem Ervits our HDP is running on a single-node configuration, and I am able to list the sqoop jobs from this node
Created 02-04-2016 01:37 AM
After some more testing, I finally resolved this issue by explicitly passing in the metastore URL in the workflow.xml like so:
<arg>job</arg> <arg>--meta-connect</arg> <arg>jdbc:hsqldb:hsql://<myhost>:12345/sqoop</arg> <arg>--exec</arg> <arg>myjob</arg>
I'm not exactly sure, but I think that Oozie tries to connect to a local metastore because it doesn't have a copy of `sqoop-site.xml` and so it doesn't know the metastore url (even though I'm running on a single-node configuration)?
Created 02-04-2016 01:44 AM
excellent, you beat me by 5min, I was about to point you to the doc for that. I should've looked at your workflow more carefully. @Luis Antonio Torres
Created 02-04-2016 01:49 AM
Thanks for all of your help so far, Artem. I do have a question regarding the metastore though - hopefully you could shed some light on this for me. So far, I've only been able to start the metastore via command-line and it runs in foreground. This is of course unacceptable in a fully automated process. I'm assuming there's a way to run this as a service instead, and for that I would need sqoop server?
Created 02-04-2016 01:55 AM
unfortunately I don't know of any built-in solution to run it as a service at least in 1.x line. What I have been doing was putting sqoop metastore service in background. YOu'd do that by issuing your command "sqoop metastore &" you can also run screen and run sqoop metastore in screen session and detach from it. Of course you have no idea if it goes down until jobs start failing. There was a promise to have metastore running in a relational database but I haven't heard anything yet. @Luis Antonio Torres
Created 02-04-2016 02:24 AM
I find that odd and such a shame. It seems like a metastore service is something that would be an important requirement in order to run incremental imports as jobs and then called from a coordinator app. I did see this blog describing how to manually set-up mysql to work as the metastore, though I never tried it myself. I wonder if this would be preferable to running `sqoop metastore &`?
Created 02-04-2016 02:34 AM
try it out and blog about it here. I'm sure people will find it valuable. The other thing I just thought of is you can write a script that will constantly ping the metastore and send email if it's down. I've done that to check for firewall status. I feel your pain and I was in the same boat. You can look at Attunity Replicate or Golden Gate for CDC solution and not rely on sqoop incremental jobs. @Luis Antonio Torres
Created 05-09-2016 02:21 PM
@Luis Antonio Torres: @Artem ErvitsThat sounds like what would fix it for me. But I do have one more question, how do I pass username and password through <arg> for the metaconnect? I do have a password set and it does not allow to pass --username and --password like we usually do. Alternatively, I tried connectURL/sqoop?user=sqoop&password=sqoop, which actually works through command line but not when passed through oozie.