Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

java: page allocation failure during hbase import

java: page allocation failure during hbase import

New Contributor

I'm running this command to import ~25G of data on a new cluster (original data came from another cluster that can't be restarted to enable snapshottting etc)

 

time sudo su -s /bin/bash -c 'hbase org.apache.hadoop.hbase.mapreduce.Driver import my_table /hbase/my_table_export' hdfs

 

During the load, all seems to be well, but Cloudera Manager tagged all of my data nodes as a warning state with this message:

 

The health of this role's host was concerning. The following health checks were concerning: frame er...

 

I jumped on the boxes and ran sar -n EDEV and sure enough there were some rx drops, but not many:

 

10:20:01 AM      eth0      0.00      0.00      0.00      5.87      0.00      0.00      0.58      0.00      0.00

 

 

This is in syslog:

 

Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: java: page allocation failure. order:5, mode:0x20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: Pid: 51190, comm: java Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: Call Trace:
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81166b42>] ? kmem_getpages+0x62/0x170
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8116775a>] ? fallback_alloc+0x1ba/0x270
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811671af>] ? cache_grow+0x2cf/0x320
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811674d9>] ? ____cache_alloc_node+0x99/0x160
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811686a0>] ? kmem_cache_alloc_node_trace+0x90/0x200
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811688bd>] ? __kmalloc_node+0x4d/0x60
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143db5d>] ? __alloc_skb+0x6d/0x190
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143ec80>] ? skb_copy+0x40/0xb0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffffa014227c>] ? tg3_start_xmit+0xa8c/0xd50 [tg3]
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81449168>] ? dev_hard_start_xmit+0x308/0x530
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81449168>] ? dev_hard_start_xmit+0x308/0x530
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff814674ea>] ? sch_direct_xmit+0x15a/0x1c0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8144ce70>] ? dev_queue_xmit+0x3b0/0x550
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff814855a8>] ? ip_finish_output+0x148/0x310
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81485828>] ? ip_output+0xb8/0xc0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81484b25>] ? ip_local_out+0x25/0x30
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81485000>] ? ip_queue_xmit+0x190/0x420
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8112bb53>] ? __alloc_pages_nodemask+0x113/0x8d0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81499d0e>] ? tcp_transmit_skb+0x40e/0x7b0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8149c11b>] ? tcp_write_xmit+0x1fb/0xa20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81166869>] ? ksize+0x19/0x80
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143db71>] ? __alloc_skb+0x81/0x190
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8149c970>] ? tcp_push_one+0x30/0x40
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8148d60c>] ? tcp_sendmsg+0x9cc/0xa20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143804b>] ? sock_aio_write+0x19b/0x1c0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81180e1a>] ? do_sync_write+0xfa/0x140
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811c6be5>] ? ep_remove+0xa5/0xc0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8121bce6>] ? security_file_permission+0x16/0x20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811811e4>] ? vfs_write+0x184/0x1a0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81181a11>] ? sys_write+0x51/0x90
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff810dc5b5>] ? __audit_syscall_exit+0x265/0x290
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

 

 

This is the output from free, so I'm not running out of swap.  In fact the box is not swapping at all:

 

# free
total used free shared buffers cached
Mem: 32838572 26035112 6803460 0 310072 14287708
-/+ buffers/cache: 11437332 21401240
Swap: 4194296 0 4194296

 

 

Any ideas as to what could be misconfigured?  I'm new to hadoop/hbase so I'm not too sure what to look at to figure out what is going on.

I also do see these in the hbase log around that time:

 

2013-09-17 10:18:45,280 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 3 on 60020' on region tsdb,\x00\x02\x17Q\xA4\x9C@\x00\x00\x02\x00\x00e\x00\x00\x0A\x00\x00/\x00\x00\x0B\x00\x00S\x00\x00\x0C\x00\x00?\x00\x00\x0D\x00\x00<,1379438210353.9303046b348edf3d495e93e4ebf7fa80.: memstore size 271.5 M is >= than blocking 256 M size

 

some of these too:

 

2013-09-17 10:19:05,480 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":20186,"call":"multi(Region=tsdb,\u0000\u0002\u0017Q��@\u0000\u0000\u0002\u0000\u0000e\u0000\u0000\n\u0000\u0000/\u0000\u0000\u000B\u0000\u0000S\u0000\u0000\f\u0000\u0000?\u0000\u0000\r\u0000\u0000<,1379438210353.9303046b348edf3d495e93e4ebf7fa80. of 935 actions and first row key= \u0000\u0002\u0017Q�Q0\u0000\u0000\u0002\u0000\u0001�\u0000\u0000\n\u0000\u0000\\\u0000\u0000\u000B\u0000\u0000>\u0000\u0000\f\u0000\u0000V\u0000\u0000\r\u0000\u0000U), rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.110.11.23:34862","starttimems":1379438325276,"queuetimems":2336,"class":"HRegionServer","responsesize":0,"method":"multi"}

 

 

2013-09-17 10:19:06,102 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted
java.io.InterruptedIOException: Aborting compaction of store t in region tsdb,\x00\x02\x17Q\xA4\x9C@\x00\x00\x02\x00\x00e\x00\x00\x0A\x00\x00/\x00\x00\x0B\x00\x00S\x00\x00\x0C\x00\x00?\x00\x00\x0D\x00\x00<,1379438210353.9303046b348edf3d495e93e4ebf7fa80. because it was interrupted.
at org.apache.hadoop.hbase.regionserver.Compactor.isInterrupted(Compactor.java:230)
at org.apache.hadoop.hbase.regionserver.Compactor.compact(Compactor.java:203)
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1081)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1335)
at org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest.run(CompactionRequest.java:303)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

 

 

I do see some short circuit requests in the hdfs log, but nothing stands out to me in there as being an issue:

 

2013-09-17 10:18:13,785 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: -4973681609466960260, srvID: DS-165836706-10.110.11.23-50010-1379371457984, success: true