About apriore

apriore · ‎09-10-2016

Thank you for sharing your script on Github. I made some changes to the sqoop import command and I ran the script with 4 concurrent imports, from MySQL to Hive: sqoop import -D mapreduce.job.queuename=$queue -D mapreduce.job.ubertask.enable=true --connect jdbc:mysql://$origServer:3306/$origDatabase --username=$myUser --password=$myPassword --driver com.mysql.jdbc.Driver --connection-manager org.apache.sqoop.manager.GenericJdbcManager --query "select a.* from $origTable a where \$CONDITIONS" -m 1 --fields-terminated-by '\t' --outdir $dirJavaGeneratedCode --hcatalog-home $hCatalogHome --hcatalog-database $hiveDatabase --hcatalog-table $myTable --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' >> $logFileRaw 2>> $logFileRaw I'm importing 971 tables from MySQL for a total size on disk of about 5GB. I'm getting the following error after importing (successfully) about 130 tables: 16/09/09 15:52:27 INFO mapreduce.Job: Task Id : attempt_1473456475868_0168_m_000000_0, Status : FAILED Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hive/warehouse/development_testing.db/financialprograms/_SCRATCH0.047695628142107815/_temporary/1/_temporary/attempt_1473456475868_0168_m_000000_0/part-m-00000 could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200) It's not related the DataNodes being down since I tried the import several times and it always stops around the same point (like it's reaching a limit somewhere); the script actually doesn't stop but it processes all tables: tables are created in Hive, Hadoop makes 2 attempts to execute the job (import MySQL table) and finally it fails with the error above; the table in Hive remains empty; I noticed that a temporary file representing the table's records is created in Hive but after the job fails, the temporary file (_SCRATCH...) is deleted. I also checked if the Datanodes are healthy, and they are. I tried to double the value of: NameNode Java heap size and DataNode maximum Java heap size (from 1024MB to 2048MB) in HDFS and also to double the value of 'Hadoop maximum Java heap size' to 2048MB with no success. It's a configuration issue (more heap size required? Where?) but I don't know where to look at at this point. There is disk space on the datanodes (less then 45% of the space is used and there is enough space to import the MySQL tables to Hive). Datanodes (4) have 15GB memory. Any suggestion to troubleshoot the issue is highly appreciated.

Online	Offline
Last Visited	‎10-24-2016 08:58 PM

Member Since	‎09-09-2016 11:54 PM
Last Visited	‎10-24-2016 08:58 PM
Posts	1

Cloudera Community

Re: Using Sqoop to fetch many tables in parallel