Support Questions

Find answers, ask questions, and share your expertise

Oozie Sqoop job - cannot restore job

avatar
Expert Contributor

On HDP 2.3.4, using Oozie 4.2.0 and Sqoop 1.4.2, I'm trying to create a coordinator app that will execute sqoop jobs on a daily basis. I need the sqoop action to execute jobs because these are incremental imports.

I've configured `sqoop-site.xml` and started the `sqoop-metastore` and I'm able to create, list, and delete jobs via the command line but the workflow encounters the error: Cannot restore job: streamsummary_incremental

stderr

Sqoop command arguments :
             job
             --exec
             streamsummary_incremental
Fetching child yarn jobs
tag id : oozie-26fcd4dc0afd8f53316fc929ac38eae2
2016-02-03 09:46:47,193 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at <myhost>/<myIP>:8032
Child yarn jobs are found - 
=================================================================

>>> Invoking Sqoop command line now >>>

2241 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2016-02-03 09:46:47,404 WARN  [main] tool.SqoopTool (SqoopTool.java:loadPluginsFromConfDir(177)) - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2263 [main] INFO  org.apache.sqoop.Sqoop  - Running Sqoop version: 1.4.6.2.3.4.0-3485
2016-02-03 09:46:47,426 INFO  [main] sqoop.Sqoop (Sqoop.java:<init>(97)) - Running Sqoop version: 1.4.6.2.3.4.0-3485
2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage  - Cannot restore job: streamsummary_incremental
2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(254)) - Cannot restore job: streamsummary_incremental
2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage  - (No such job)
2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(255)) - (No such job)
2553 [main] ERROR org.apache.sqoop.tool.JobTool  - I/O error performing job operation: java.io.IOException: Cannot restore missing job streamsummary_incremental
	at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.read(HsqldbJobStorage.java:256)
	at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:198)
	at org.apache.sqoop.tool.JobTool.run(JobTool.java:283)
	at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
	at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
	at org.apache.oozie.action.hadoop.SqoopMain.runSqoopJob(SqoopMain.java:197)
	at org.apache.oozie.action.hadoop.SqoopMain.run(SqoopMain.java:177)
	at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
	at org.apache.oozie.action.hadoop.SqoopMain.main(SqoopMain.java:46)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:241)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

sqoop-site.xml

  <property>
    <name>sqoop.metastore.client.enable.autoconnect</name>
    <value>false</value>
    <description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.client.autoconnect.url</name>
    <value>jdbc:hsqldb:hsql://<myhost>:12345</value>
    <description>The connect string to use when connecting to a job-management metastore. If unspecified, uses ~/.sqoop/. You can specify a different path here.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.client.autoconnect.username</name>
    <value>SA</value>
    <description>The username to bind to the metastore.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.client.autoconnect.password</name>
    <value></value>
    <description>The password to bind to the metastore.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.server.location</name>
    <value>/tmp/sqoop-metastore/shared.db</value>
    <description>Path to the shared metastore database files. If this is not set, it will be placed in ~/.sqoop/.
    </description>
  </property>   
  <property>
    <name>sqoop.metastore.server.port</name>
    <value>12345</value>
    <description>Port that this metastore should listen on.
    </description>
  </property>

workflow.xml

  <action name="sqoop-import-job">
    <sqoop xmlns="uri:oozie:sqoop-action:0.2">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <prepare>
        <delete path="${outputDir}"/>
      </prepare>
      <arg>job</arg>
      <arg>--exec</arg>
      <arg>${jobId}</arg>
    </sqoop>
    <ok to="hive-load"/>
    <error to="kill-sqoop"/>
  </action>

Additional info:

  • We're only running a single-node cluster.
  • Only Sqoop Client is installed.

I'm thinking maybe Oozie isn't able to connect to the metastore because we don't have sqoop server? Could anyone confirm this? If not that, could I have missed anything else?

Thanks!

1 ACCEPTED SOLUTION

avatar
Expert Contributor

After some more testing, I finally resolved this issue by explicitly passing in the metastore URL in the workflow.xml like so:

<arg>job</arg>
<arg>--meta-connect</arg>
<arg>jdbc:hsqldb:hsql://<myhost>:12345/sqoop</arg>
<arg>--exec</arg>
<arg>myjob</arg>

I'm not exactly sure, but I think that Oozie tries to connect to a local metastore because it doesn't have a copy of `sqoop-site.xml` and so it doesn't know the metastore url (even though I'm running on a single-node configuration)?

View solution in original post

10 REPLIES 10

avatar
Master Mentor
@Luis Antonio Torres

please issue sqoop job --list from this node and from the node where oozie server runs.

avatar
Expert Contributor

@Artem Ervits our HDP is running on a single-node configuration, and I am able to list the sqoop jobs from this node

avatar
Expert Contributor

After some more testing, I finally resolved this issue by explicitly passing in the metastore URL in the workflow.xml like so:

<arg>job</arg>
<arg>--meta-connect</arg>
<arg>jdbc:hsqldb:hsql://<myhost>:12345/sqoop</arg>
<arg>--exec</arg>
<arg>myjob</arg>

I'm not exactly sure, but I think that Oozie tries to connect to a local metastore because it doesn't have a copy of `sqoop-site.xml` and so it doesn't know the metastore url (even though I'm running on a single-node configuration)?

avatar
Master Mentor

excellent, you beat me by 5min, I was about to point you to the doc for that. I should've looked at your workflow more carefully. @Luis Antonio Torres

avatar
Expert Contributor

Thanks for all of your help so far, Artem. I do have a question regarding the metastore though - hopefully you could shed some light on this for me. So far, I've only been able to start the metastore via command-line and it runs in foreground. This is of course unacceptable in a fully automated process. I'm assuming there's a way to run this as a service instead, and for that I would need sqoop server?

avatar
Master Mentor

unfortunately I don't know of any built-in solution to run it as a service at least in 1.x line. What I have been doing was putting sqoop metastore service in background. YOu'd do that by issuing your command "sqoop metastore &" you can also run screen and run sqoop metastore in screen session and detach from it. Of course you have no idea if it goes down until jobs start failing. There was a promise to have metastore running in a relational database but I haven't heard anything yet. @Luis Antonio Torres

avatar
Expert Contributor

I find that odd and such a shame. It seems like a metastore service is something that would be an important requirement in order to run incremental imports as jobs and then called from a coordinator app. I did see this blog describing how to manually set-up mysql to work as the metastore, though I never tried it myself. I wonder if this would be preferable to running `sqoop metastore &`?

avatar
Master Mentor

try it out and blog about it here. I'm sure people will find it valuable. The other thing I just thought of is you can write a script that will constantly ping the metastore and send email if it's down. I've done that to check for firewall status. I feel your pain and I was in the same boat. You can look at Attunity Replicate or Golden Gate for CDC solution and not rely on sqoop incremental jobs. @Luis Antonio Torres

avatar
Expert Contributor

@Luis Antonio Torres: @Artem ErvitsThat sounds like what would fix it for me. But I do have one more question, how do I pass username and password through <arg> for the metaconnect? I do have a password set and it does not allow to pass --username and --password like we usually do. Alternatively, I tried connectURL/sqoop?user=sqoop&password=sqoop, which actually works through command line but not when passed through oozie.