Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

java: page allocation failure during hbase import

Highlighted

java: page allocation failure during hbase import

New Contributor

I'm running this command to import ~25G of data on a new cluster (original data came from another cluster that can't be restarted to enable snapshottting etc)

 

time sudo su -s /bin/bash -c 'hbase org.apache.hadoop.hbase.mapreduce.Driver import my_table /hbase/my_table_export' hdfs

 

During the load, all seems to be well, but Cloudera Manager tagged all of my data nodes as a warning state with this message:

 

The health of this role's host was concerning. The following health checks were concerning: frame er...

 

I jumped on the boxes and ran sar -n EDEV and sure enough there were some rx drops, but not many:

 

10:20:01 AM      eth0      0.00      0.00      0.00      5.87      0.00      0.00      0.58      0.00      0.00

 

 

This is in syslog:

 

Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: java: page allocation failure. order:5, mode:0x20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: Pid: 51190, comm: java Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: Call Trace:
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81166b42>] ? kmem_getpages+0x62/0x170
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8116775a>] ? fallback_alloc+0x1ba/0x270
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811671af>] ? cache_grow+0x2cf/0x320
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811674d9>] ? ____cache_alloc_node+0x99/0x160
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811686a0>] ? kmem_cache_alloc_node_trace+0x90/0x200
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811688bd>] ? __kmalloc_node+0x4d/0x60
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143db5d>] ? __alloc_skb+0x6d/0x190
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143ec80>] ? skb_copy+0x40/0xb0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffffa014227c>] ? tg3_start_xmit+0xa8c/0xd50 [tg3]
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81449168>] ? dev_hard_start_xmit+0x308/0x530
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81449168>] ? dev_hard_start_xmit+0x308/0x530
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff814674ea>] ? sch_direct_xmit+0x15a/0x1c0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8144ce70>] ? dev_queue_xmit+0x3b0/0x550
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff814855a8>] ? ip_finish_output+0x148/0x310
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81485828>] ? ip_output+0xb8/0xc0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81484b25>] ? ip_local_out+0x25/0x30
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81485000>] ? ip_queue_xmit+0x190/0x420
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8112bb53>] ? __alloc_pages_nodemask+0x113/0x8d0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81499d0e>] ? tcp_transmit_skb+0x40e/0x7b0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8149c11b>] ? tcp_write_xmit+0x1fb/0xa20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81166869>] ? ksize+0x19/0x80
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143db71>] ? __alloc_skb+0x81/0x190
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8149c970>] ? tcp_push_one+0x30/0x40
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8148d60c>] ? tcp_sendmsg+0x9cc/0xa20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8143804b>] ? sock_aio_write+0x19b/0x1c0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81180e1a>] ? do_sync_write+0xfa/0x140
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811c6be5>] ? ep_remove+0xa5/0xc0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8121bce6>] ? security_file_permission+0x16/0x20
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff811811e4>] ? vfs_write+0x184/0x1a0
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff81181a11>] ? sys_write+0x51/0x90
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff810dc5b5>] ? __audit_syscall_exit+0x265/0x290
Sep 17 10:18:36 sfo-test-ops-hbase-03 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

 

 

This is the output from free, so I'm not running out of swap.  In fact the box is not swapping at all:

 

# free
total used free shared buffers cached
Mem: 32838572 26035112 6803460 0 310072 14287708
-/+ buffers/cache: 11437332 21401240
Swap: 4194296 0 4194296

 

 

Any ideas as to what could be misconfigured?  I'm new to hadoop/hbase so I'm not too sure what to look at to figure out what is going on.

I also do see these in the hbase log around that time:

 

2013-09-17 10:18:45,280 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 3 on 60020' on region tsdb,\x00\x02\x17Q\xA4\x9C@\x00\x00\x02\x00\x00e\x00\x00\x0A\x00\x00/\x00\x00\x0B\x00\x00S\x00\x00\x0C\x00\x00?\x00\x00\x0D\x00\x00<,1379438210353.9303046b348edf3d495e93e4ebf7fa80.: memstore size 271.5 M is >= than blocking 256 M size

 

some of these too:

 

2013-09-17 10:19:05,480 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":20186,"call":"multi(Region=tsdb,\u0000\u0002\u0017Q��@\u0000\u0000\u0002\u0000\u0000e\u0000\u0000\n\u0000\u0000/\u0000\u0000\u000B\u0000\u0000S\u0000\u0000\f\u0000\u0000?\u0000\u0000\r\u0000\u0000<,1379438210353.9303046b348edf3d495e93e4ebf7fa80. of 935 actions and first row key= \u0000\u0002\u0017Q�Q0\u0000\u0000\u0002\u0000\u0001�\u0000\u0000\n\u0000\u0000\\\u0000\u0000\u000B\u0000\u0000>\u0000\u0000\f\u0000\u0000V\u0000\u0000\r\u0000\u0000U), rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.110.11.23:34862","starttimems":1379438325276,"queuetimems":2336,"class":"HRegionServer","responsesize":0,"method":"multi"}

 

 

2013-09-17 10:19:06,102 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted
java.io.InterruptedIOException: Aborting compaction of store t in region tsdb,\x00\x02\x17Q\xA4\x9C@\x00\x00\x02\x00\x00e\x00\x00\x0A\x00\x00/\x00\x00\x0B\x00\x00S\x00\x00\x0C\x00\x00?\x00\x00\x0D\x00\x00<,1379438210353.9303046b348edf3d495e93e4ebf7fa80. because it was interrupted.
at org.apache.hadoop.hbase.regionserver.Compactor.isInterrupted(Compactor.java:230)
at org.apache.hadoop.hbase.regionserver.Compactor.compact(Compactor.java:203)
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1081)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1335)
at org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest.run(CompactionRequest.java:303)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

 

 

I do see some short circuit requests in the hdfs log, but nothing stands out to me in there as being an issue:

 

2013-09-17 10:18:13,785 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: -4973681609466960260, srvID: DS-165836706-10.110.11.23-50010-1379371457984, success: true