Support Questions

sim6 · ‎07-02-2016

I have asked this question over and over and am starting to believe that I am missing out on something really basic here. Not a lot of people seem to have come across this and I am really stuck on this one:

I get this error when I specify merge key argument with incremental import lastmodified in sqoop. If I run the job through command line, it works alright but not when I submit it to oozie. I submit my jobs through oozie. Not sure if oozie is the problem or hue, but sqoop job is not since it really works fine when executed through command line including the merge step.

My sqoop job looks like this:

sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop 
--create test_table -- import --driver com.mysql.jdbc.Driver --connect
jdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username 
USER_NAME --password 'PASSWORD' --table test_table --merge-key id --
split-by id --target-dir LOCATION --incremental lastmodified 
--last-value 0 --check-column updated_at

The first import works alright .Starting second import I get:

I created a small test table to test with an int, datetime and varchar , without any NULL or invalid chars in the data and yet I faced the same issue:

# id, updated_at, name
'1', '2016-07-02 17:16:53', 'l'
'3', '2016-06-29 14:12:53', 'f'

There were only 2 rows in the data and yet I got this:

Error: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:330)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:51)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1848)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1508)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:723)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Error: java.io.IOException: Illegal partition for 3 (-2)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1083)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.sqoop.mapreduce.MergeMapperBase.processRecord(MergeMapperBase.java:82)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:58)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Error: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:330)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:51)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1848)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1508)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:723)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Error: java.io.IOException: Illegal partition for 1 (-2)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1083)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.sqoop.mapreduce.MergeMapperBase.processRecord(MergeMapperBase.java:82)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:58)
at org.apache.sqoop.mapreduce.MergeTextMapper.map(MergeTextMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I get this error only in OOZIE and submit the job through HUE and this works just fine including the Merge mapreduce when I run the sqoop job through command line

Taken from oozie launcher, This is what my mapreduce job logs look like:

>>> Invoking Sqoop command line now >>>
5373 [uber-SubtaskRunner] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
5407 [uber-SubtaskRunner] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6-cdh5.7.0
5702 [uber-SubtaskRunner] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
5715 [uber-SubtaskRunner] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
5740 [uber-SubtaskRunner] WARN org.apache.sqoop.ConnFactory - Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
5754 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Using default fetchSize of 1000
5754 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.CodeGenTool - Beginning code generation
6091 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM test_table AS t WHERE 1=0
6098 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM test_table AS t WHERE 1=0
6118 [uber-SubtaskRunner] INFO org.apache.sqoop.orm.CompilationManager - HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-mapreduce
8173 [uber-SubtaskRunner] INFO org.apache.sqoop.orm.CompilationManager - Writing jar file: /tmp/sqoop-yarn/compile/454902ac78d49b783a1f51b7bfe0a2be/test_table.jar
8185 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM test_table AS t WHERE 1=0
8192 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.ImportTool - Incremental import based on column updated_at
8192 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.ImportTool - Lower bound value: '2016-07-02 17:13:24.0'
8192 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.ImportTool - Upper bound value: '2016-07-02 17:16:56.0'
8194 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Beginning import of test_table
8214 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM test_table AS t WHERE 1=0
8230 [uber-SubtaskRunner] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies.
8716 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.db.DBInputFormat - Using read commited transaction isolation
8717 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - BoundingValsQuery: SELECT MIN(id), MAX(id) FROM test_table WHERE ( updated_at >= '2016-07-02 17:13:24.0' AND updated_at < '2016-07-02 17:16:56.0' )
8721 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.db.IntegerSplitter - Split size: 0; Num splits: 4 from: 1 to: 1
25461 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Transferred 26 bytes in 17.2192 seconds (1.5099 bytes/sec)
25471 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Retrieved 1 records.
25536 [uber-SubtaskRunner] WARN org.apache.sqoop.mapreduce.ExportJobBase - IOException checking input file header: java.io.EOFException
25550 [uber-SubtaskRunner] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies.
Heart beat
Heart beat
70628 [uber-SubtaskRunner] ERROR org.apache.sqoop.tool.ImportTool - Merge MapReduce job failed!
70628 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.ImportTool - Saving incremental import state to the metastore
70831 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.ImportTool - Updated data for job: test_table

yshi · ‎07-05-2016

@sim6

Sorry for the late response! I am currently on a holiday, and thus can only check this problem occasionally.

It doesn't make sense that the merge job doesn't have a partitioner class, because it does need a reduce phase to merge the records. Could you please:

1) Make sure that you are checking the merge job, not the import job.

2) Please check the number of reducers of the failed merge job

3) Please try to add parameter "-Dmapred.reduce.tasks=1" to your Sqoop import job and see whether it helps

4) If nothing above helps, let's try to narrow down the problem a bit by removing the impact of Sqoop metastore. Please run the command directly in Oozie Sqoop action instead of storing it in metastore. The command should be something like below (I have removed the --driver parameter as well to use the MySQL connector instead of the generic JDBC connector, although it shouldn't make any difference for our problem).

sqoop import -Dmapred.reduce.tasks=1 --connect
jdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username 
USER_NAME --password 'PASSWORD' --table test_table --merge-key id --
split-by id --target-dir LOCATION --incremental lastmodified 
--last-value 0 --check-column updated_at

When I have some time, I will try to reproduce your problem. Please let me know the CDH and CM version you are using.

View solution in original post

sim6 · ‎07-04-2016

Thank you.

I do get the logs now but none of the jobs have the two properties you asked for, Shall I share rest of the properties from the logs ? @yshi

Also, from the command line, I execute the job with both root user and hue user and the execution succedes in both the cases without any problem. My job is submitted to oozie using hue user only and that one fails.

sim6 · ‎07-04-2016

@yshi: I did sqoop job --show test_table

I get the following results:

verbose = false
incremental.last.value = 2016-07-05 05:29:33.0
db.connect.string = jdbc:mysql://10.10.11.3/testdb?zeroDateTimeBehavior=convertToNull
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
split.limit = null
hbase.create.table = false
db.require.password = false
hdfs.append.dir = false
db.table = test_table
codegen.input.delimiters.escape = 0
db.password = St@N!$$r3789L0vE
accumulo.create.table = false
import.fetch.size = null
codegen.input.delimiters.enclose.required = false
db.username = root
reset.onemapper = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hbase.bulk.load.enabled = false
hcatalog.create.table = false
db.clear.staging.table = false
incremental.col = updated_at
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
accumulo.batch.size = 10240000
hive.drop.delims = false
customtool.options.jsonmap = {}
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
relaxed.isolation = false
mapreduce.num.mappers = 4
accumulo.max.latency = 5000
import.direct.split.size = 0
codegen.output.delimiters.field = 44
export.new.update = UpdateOnly
incremental.mode = DateLastModified
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-root/compile/58af2027880638f91ea875a1c51b5de8
direct.import = false
db.split.column = id
hdfs.target.dir = /user/hue/ERP/testinc
hive.fail.table.exists = false
merge.key.col = id
jdbc.driver.class = com.mysql.jdbc.Driver
db.batch = false

Value of merge.key.col as expected is id.

But I still do not get the value other attribute you asked for. @yshi

yshi · ‎07-05-2016

@sim6

Sorry for the late response! I am currently on a holiday, and thus can only check this problem occasionally.

It doesn't make sense that the merge job doesn't have a partitioner class, because it does need a reduce phase to merge the records. Could you please:

1) Make sure that you are checking the merge job, not the import job.

2) Please check the number of reducers of the failed merge job

3) Please try to add parameter "-Dmapred.reduce.tasks=1" to your Sqoop import job and see whether it helps

4) If nothing above helps, let's try to narrow down the problem a bit by removing the impact of Sqoop metastore. Please run the command directly in Oozie Sqoop action instead of storing it in metastore. The command should be something like below (I have removed the --driver parameter as well to use the MySQL connector instead of the generic JDBC connector, although it shouldn't make any difference for our problem).

sqoop import -Dmapred.reduce.tasks=1 --connect
jdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username 
USER_NAME --password 'PASSWORD' --table test_table --merge-key id --
split-by id --target-dir LOCATION --incremental lastmodified 
--last-value 0 --check-column updated_at

When I have some time, I will try to reproduce your problem. Please let me know the CDH and CM version you are using.

sim6 · ‎07-05-2016

Awesome. Using

-Dmapred.reduce.tasks=1

worked. I used it in my job with --meta-connect arguement as well and it still works 🙂 I would really like to understand what goes behind the scenes. Could you please explain why exactly I needed to use this argument? @yshi

Also, I posted an issue here and it seems to not have gained any attention, Could you please have a look at it and let me know if this is supported at all?

http://community.cloudera.com/t5/Data-Ingestion-Integration/change-sqoop-metastore-timezone-to-GMT/m...

Other than that, Thank you for keeping up all this time and helping out 🙂 . Happy holiday 🙂 🙂

yshi · ‎07-07-2016

@sim6

I am not 100% sure how my solutions works at this moment. And because I am on holiday, I cannot reproduce this problem to figure out the exact root cause.

I worked out this workaround with below observations:

1) Sqoop import job has no reducers, but Sqoop merge job does have merge job

2) According to all the information you have provided, your Oozie Sqoop job seems to fail because lack of partition class definition in job configuration.

3) I was suspecting that Oozie Sqoop reuse the jobConf object between Sqoop import and merge job, and thus accidentally set the number of reducers in merge job to 0. By explicitly setting the number of reducers to 1, we might be able to workaround this problem. And we are lucky that this does work. 🙂

I will later try to reproduce this problem and check whether my theory is correct or not.

yshi · ‎07-11-2016

@sim6

I cannot reproduce your problem on my local cluster. Not sure how this happened to you.

Please remember to increase the number of reducers when your data grow. 1 reducer is definitely not enough for big tables.

I will have a look at the other problem you asked here when I have some time.

sim6 · ‎07-13-2016

Thank you. I see that but I have always had this issue and is likely to happen again on increasing the number of clusters.Could it be some configurations or something that are causing this particular issue?

Also, look forward to a response on the other question. Thank you 🙂 @yshi

sim6 · ‎07-18-2016

@yshi: Hey, was hoping to hear back on the other issue I had created 🙂 . Is there a way I can bump that question on the forum to get some attention to it?

yshi · ‎07-19-2016

@sim6

Sorry, I am very busy these days to catch up with my work after the holiday.

I will find some time to look at the other problem tomorrow.

cjervis · ‎07-20-2016

@sim6

Thanks for asking about the ability to bump up posts. The forums are a peer to peer platform but Clouderan's do participate from time to time. If you need a something more, perhaps the Cloudera Developer Program would be a match to your needs.

Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Cloudera Community

Support Questions

IllegalArgumentException and Illegal partition for 'val' in sqoop