Member since
12-10-2015
43
Posts
39
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
857 | 02-04-2016 01:37 AM | |
3944 | 02-03-2016 02:03 AM | |
1926 | 01-26-2016 08:00 AM |
05-10-2016
01:13 AM
Unfortunately, I can't help you there. It's not something I've tried. You should create a new question for that to get better visibility and hopefully attract answers.
... View more
03-31-2016
05:48 AM
I'm able to do the incremental import when I don't specify --fields-terminated-by '\0'.
... View more
03-30-2016
08:20 AM
1 Kudo
I'm trying to import data from a MySQL table which contains freetext with the following command: sqoop import \
--connect jdbc:mysql://mysqlhost:3306/db
--table t1
--username u
-P
--incremental lastmodified
--check-column updated
--last-value
--split-by id
--fields-terminated-by '\0'
--hive-drop-import-delims
--target-dir /path/to/target
--merge-key id
The import part seems to work correctly but then the merge fails with the message Can't parse input data: Error: java.lang.RuntimeException: Can't parse input data: '12016-03-17 13:48:28.0nullfoo'
at t1.__loadFromFields(t1.java:343)
at t1.parse(t1.java:276)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:53)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.NumberFormatException: For input string: "12016-03-17 13:48:28.0nullfoo"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at t1.__loadFromFields(t1.java:325)
... 11 more
I also tried to provide my own jar using sqoop codegen with the same formatting options but it's made no difference.
... View more
Labels:
03-18-2016
07:13 AM
2 Kudos
I did try it before I updated my post, and the OR condition doesn't work for me. Only the condition on the left side of the OR seems to be evaluated. For example, in the condition "where created >= yesterday or updated >= yesterday" - only "created >= yesterday" is evaluated. I tried reversing the order of the conditions giving me "where updated >= yesterday or created >= yesterday" - and in this case only "updated >= yesterday" is evaluated.
... View more
03-18-2016
01:30 AM
2 Kudos
Just went through the documentation though, and there's a note saying free-form queries can't have OR in the WHERE clause?
... View more
03-18-2016
12:57 AM
1 Kudo
I'd thought of that actually, though I try to avoid hard-coding SQL whenever I use Sqoop. To me, it feels a bit hacky. It'd be interesting to hear people's thoughts on that, however.
... View more
03-17-2016
08:43 AM
3 Kudos
I have to import daily data from SQL Server using Sqoop (1.4.6). Naturally, incremental imports would seem to be the way to go, but here's the thing - the source tables I have to import have separate 'created' and 'updated' columns. That is, newly-created records only have a timestamp value under the 'created' column, and records will only have a value for the 'updated' column if they are edited anytime after first creation. Looking through the documentation, it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way I've managed to get this done is with 2 separate imports:
Incremental import with the primary key as check column for NEW records Incremental import with the updated column as check column for UPDATED records This works, but I wonder if there's a better way to do this. Any thoughts? Update: I tried the suggestion by @Sourygna Luangsay to use free-form query imports, however the documentation quite clearly states that the query can't contain OR conditions in the WHERE clause. Besides, since these will be incremental imports, the output directory would exist and I'd still need a two-step workflow: the first step for importing, then the second for merging.
... View more
Labels:
02-10-2016
01:47 AM
Thanks @Predrag Minovic! I guess I'll have to do some trial-and-error to see which column other than the binary-type primary keys we can use as a splitting column.The fact that all of the primary keys in this database are binary-type has given me quite a number of head-scratching moments particularly with the incremental imports.
... View more
02-10-2016
01:44 AM
1 Kudo
@Artem Ervits our DBA's probably know how to perform traces on the
DB Server. I think they're just used to asking our dev team first for
the
actual SQL first since most of the applications using this DB server are
written as stored
procedures.
... View more
02-10-2016
01:43 AM
1 Kudo
Thanks for the link @Neeraj Sabharwal. It's very helpful. I guess I'll have to try to play around with the Sqoop commands to find a splitting column that will work best for us.
... View more
02-10-2016
01:23 AM
@Geoffrey Shelton Okot thanks for the link, but it doesn't really help in this instance since the DBA's are only concerned with the source (SQL Server) system. The DBA's probably can generate the stats and execution plan on the DB Server. I think they're just used to asking our dev team first for the
actual SQL first since most of the applications using this DB server are written as stored
procedures.
... View more
02-10-2016
01:21 AM
Thanks for this @Scott Shaw. By "running queries" do you mean the SQL you provided will only display results for any queries that are currently running? In any case, the DBA's probably can generate the stats and execution plan on the DB Server. I think they're just used to asking our dev team for the
actual SQL first since most of the applications using this DB server are written as stored
procedures.
... View more
02-10-2016
01:17 AM
Hi @Predrag Minovic, if in the Sqoop imports I'm running I always specify 1 mapper (The primary keys in the source table are binary type; using these columns to split the job into tasks this throws an exception) would it be correct to assume that the generated SQL statement becomes a simple SELECT * with no where condition?
... View more
02-09-2016
06:17 AM
2 Kudos
Hi @Shigeru Takehara, the jdbc driver was in the Oozie sharelib. Also, the import was working fine - I see the data within HDFS. It's just the loading into Hive that fails. That's why I just opted to break the import into two separate steps.
... View more
02-09-2016
06:04 AM
1 Kudo
In Sqoop 1.4.2, is there a way to output the exact SQL statements executed? The standard output always displays something like this: INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM [tableName] AS t WHERE 1=0 but this isn't very helpful for our purposes. Our Database Admins asked me if I could provide the SQL statements so they can adjust database indices if necessary. They probably (I can't really say on their behalf) know how to perform analysis on the DB Server (we're using SQL Server for this) but since most of the applications using this DB server are written as stored procedures, I think they're used to just asking our dev team for the actual SQL instead.
... View more
Labels:
02-04-2016
02:24 AM
I find that odd and such a shame. It seems like a metastore service is something that would be an important requirement in order to run incremental imports as jobs and then called from a coordinator app. I did see this blog describing how to manually set-up mysql to work as the metastore, though I never tried it myself. I wonder if this would be preferable to running `sqoop metastore &`?
... View more
02-04-2016
01:49 AM
Thanks for all of your help so far, Artem. I do have a question regarding the metastore though - hopefully you could shed some light on this for me. So far, I've only been able to start the metastore via command-line and
it runs in foreground. This is of course unacceptable in a fully
automated process. I'm assuming there's a way to run this as a service
instead, and for that I would need sqoop server?
... View more
02-04-2016
01:37 AM
2 Kudos
After some more testing, I finally resolved this issue by explicitly passing in the metastore URL in the workflow.xml like so: <arg>job</arg>
<arg>--meta-connect</arg>
<arg>jdbc:hsqldb:hsql://<myhost>:12345/sqoop</arg>
<arg>--exec</arg>
<arg>myjob</arg> I'm not exactly sure, but I think that Oozie tries to connect to a local metastore because it doesn't have a copy of `sqoop-site.xml` and so it doesn't know the metastore url (even though I'm running on a single-node configuration)?
... View more
02-04-2016
12:35 AM
@Artem Ervits our HDP is running on a single-node configuration, and I am able to list the sqoop jobs from this node
... View more
02-04-2016
12:33 AM
2 Kudos
@Shigeru Takehara I did try adding hive-site.xml and placed it in the root of the workflow directory on HDFS, but I was running into the error and the error message in the logs is this: ERROR [main] tool.ImportTool (ImportTool.java:run(613)) - Encountered IOException running import job: java.io.IOException: Hive exited with status 1 I eventually had to go with my workaround because I couldn't get hive import to work and I had deadlines to meet. I'd still like to try and get hive import to work though
... View more
02-03-2016
04:01 AM
1 Kudo
On HDP 2.3.4, using Oozie 4.2.0 and Sqoop 1.4.2, I'm trying to create a coordinator app that will execute sqoop jobs on a daily basis. I need the sqoop action to execute jobs because these are incremental imports. I've configured `sqoop-site.xml` and started the `sqoop-metastore` and I'm able to create, list, and delete jobs via the command line but the workflow encounters the error: Cannot restore job: streamsummary_incremental
stderr
Sqoop command arguments :
job
--exec
streamsummary_incremental
Fetching child yarn jobs
tag id : oozie-26fcd4dc0afd8f53316fc929ac38eae2
2016-02-03 09:46:47,193 INFO [main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at <myhost>/<myIP>:8032
Child yarn jobs are found -
=================================================================
>>> Invoking Sqoop command line now >>>
2241 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2016-02-03 09:46:47,404 WARN [main] tool.SqoopTool (SqoopTool.java:loadPluginsFromConfDir(177)) - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2263 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6.2.3.4.0-3485
2016-02-03 09:46:47,426 INFO [main] sqoop.Sqoop (Sqoop.java:<init>(97)) - Running Sqoop version: 1.4.6.2.3.4.0-3485
2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage - Cannot restore job: streamsummary_incremental
2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(254)) - Cannot restore job: streamsummary_incremental
2552 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage - (No such job)
2016-02-03 09:46:47,715 ERROR [main] hsqldb.HsqldbJobStorage (HsqldbJobStorage.java:read(255)) - (No such job)
2553 [main] ERROR org.apache.sqoop.tool.JobTool - I/O error performing job operation: java.io.IOException: Cannot restore missing job streamsummary_incremental
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.read(HsqldbJobStorage.java:256)
at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:198)
at org.apache.sqoop.tool.JobTool.run(JobTool.java:283)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
at org.apache.oozie.action.hadoop.SqoopMain.runSqoopJob(SqoopMain.java:197)
at org.apache.oozie.action.hadoop.SqoopMain.run(SqoopMain.java:177)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
at org.apache.oozie.action.hadoop.SqoopMain.main(SqoopMain.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:241)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
sqoop-site.xml <property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>false</value>
<description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:hsqldb:hsql://<myhost>:12345</value>
<description>The connect string to use when connecting to a job-management metastore. If unspecified, uses ~/.sqoop/. You can specify a different path here.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>SA</value>
<description>The username to bind to the metastore.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value></value>
<description>The password to bind to the metastore.
</description>
</property>
<property>
<name>sqoop.metastore.server.location</name>
<value>/tmp/sqoop-metastore/shared.db</value>
<description>Path to the shared metastore database files. If this is not set, it will be placed in ~/.sqoop/.
</description>
</property>
<property>
<name>sqoop.metastore.server.port</name>
<value>12345</value>
<description>Port that this metastore should listen on.
</description>
</property>
workflow.xml <action name="sqoop-import-job">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<arg>job</arg>
<arg>--exec</arg>
<arg>${jobId}</arg>
</sqoop>
<ok to="hive-load"/>
<error to="kill-sqoop"/>
</action>
Additional info: We're only running a single-node cluster. Only Sqoop Client is
installed. I'm thinking maybe Oozie isn't able to connect to the metastore because we don't have sqoop server? Could anyone confirm this? If not that, could I have missed anything else? Thanks!
... View more
02-03-2016
02:03 AM
2 Kudos
At the same time that I was getting this issue, I was also dealing with a network issue when trying to issue Sqoop commands via CLI. Although the network issue was resolved and I stopped seeing this IOException, I kept running into new errors that I never managed to resolve. In the end, I decided to work around it by breaking the hive import into a 2-step workflow:
sqoop action to import into HDFS hive action to load data from HDFS into hive UPDATE: It turns out that the "new errors" was because the "yarn" user doesn't belong to the "hdfs" group and so couldn't perform the hive-import part. Adding this use to the group allows me now to use hive-import in my worfklows instead of the 2-step workflow I used before.
... View more
02-03-2016
01:56 AM
2 Kudos
@Artem Ervits I never got sqoop's hive-import to work in an Oozie workflow, so I came up with a workaround instead. Will provide my workaround as an answer. Thanks.
... View more
02-03-2016
12:05 AM
@Artem Ervits yes, this has been resolved but I didn't accept my own answer because we never found out what exactly was wrong. The issue merely stopped occurring after we reinstalled HDP on a server with no virtualization - I detail this in my answer. In any case, I've accepted my own answer as a reference, for the benefit of others.Thanks!
... View more
01-28-2016
02:39 AM
2 Kudos
@David Tam no, it was in "Running" state before getting killed. The yarn.resourcemanager.address setting in our yarn configs is set to port 8050, so I'm not really sure why there was an attempt to connect to 8032. I tried yarn-client mode, but I still get the same error
... View more
01-26-2016
08:00 AM
I never heard back from our network team with regards to firewall
logs, but our NameNode's OS got corrupted and had to be reformatted and
HDP reinstalled. For some reason we're not encountering this error any
longer. One difference between the original cluster and the new installation
is that we had 4 nodes (1 name node and 3 data nodes) which were
virtualized in a single server. Now, we're running a single node cluster (HDP 2.3.4)
with no virtualization on the server.
... View more
01-20-2016
12:54 AM
@Scott Shaw I removed mapred.task.timeout like you suggested, but I'm still getting the same results. As for the jdbc driver, that's the same one that we downloaded and placed in the Sqoop lib. I'm able to connect to the SQL Server with no problems via SQL Server Management Studio on my workstation which are both in the same corporate network, whereas the Hadoop cluster is on a separate network. I actually had to explicity request for our network guys to allow traffic from the cluster into the SQL Server, so I'm starting to suspect there's something in the network that's causing this issue.
... View more
01-19-2016
01:17 AM
@Neeraj Sabharwal thanks for your response. I checked out the link you shared but it doesn't seem to be the same problem. I'm able to connect to the SQL Server, it's just that I always see that error - particularly in between map tasks. In any case, I'm coordinating with our network guys as well. I can't be sure, but I do believe the SQL Server resides in a different corporate network as the Hadoop cluster. I'll update my post once I get further details.
... View more
01-19-2016
01:10 AM
@Scott Shaw Pinged the SQL Server and no dropped packets. I can also telnet to port 1433 and nmap also says the port is open. However, I tried it the other way around - I'm unable to ping the Hadoop cluster from the SQL Server. I'm not really sure if this is significant.
... View more