Created 06-06-2017 07:30 AM
We are having hdfs small size problem , so we were using this code from Github for merging parquet file .
https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
Step 1 - Performed Local Maven build
- maven build
hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet
Total size: 1796526652 B Total dirs: 1 Total files: 4145 Total symlinks: 0 Total blocks (validated): 4146 (avg. block size 433315 B) Minimally replicated blocks: 4146 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Tue Jun 06 13:04:57 IST 2017 in 82 milliseconds The filesystem under path '/user/hive/warehouse/final_parsing.db/02day02/' is HEALTHY
Below are the current configs that didn't help us .
so we should bump it up But it didnt help us either Same error . dfs.blocksize - 134217728 equals 64 MB dfs.client-write-packet-size 65536 dfs.client.read.shortcircuit.streams.cache.expiry.ms 300000 dfs.stream-buffer-size 4096 dfs.client.read.shortcircuit.streams.cache.size is 256 Linux Ulimit - 3024
[UD1@slave1 target]# hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet 17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: BlockReaderFactory(fileName=/user/hive/warehouse/final_parsing.db/02day02/part-r-00000-377a9cc1-841a-4ec6-9e0f-0c009f44f6b3.snappy.parquet, block=BP-1780335730-192.168.200.234-1492815207875:blk_1074179291_439068): error creating ShortCircuitReplica. java.io.IOException: Illegal seek at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684) at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:124) at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126) at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:619) at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:551) at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:484) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64) at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496) at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494) at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79) at org.apache.parquet.tools.Main.main(Main.java:223) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 17/06/06 12:48:16 WARN shortcircuit.ShortCircuitCache: ShortCircuitCache(0x32442dd0): failed to load 1074179291_BP-1780335730-192.168.200.234-1492815207875 17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method) at sun.nio.ch.Net.socket(Net.java:423) at sun.nio.ch.Net.socket(Net.java:416) at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:104) at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60) at java.nio.channels.SocketChannel.open(SocketChannel.java:142) at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3526) at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:840) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:755) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64) at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496) at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494) at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79) at org.apache.parquet.tools.Main.main(Main.java:223) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Created 06-06-2017 10:55 AM
Created 06-06-2017 05:14 PM
Created 06-07-2017 05:02 AM
Parquet - Merge
No. of files Ulimit Size (GB) 5000 15000 2.1 4140 10000 1.4
System Configuration
model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cpu MHz : 1257.457 cache size : 20480 KB cpu cores : 8 processor : 15 cpu MHz : 1212.914 cache size : 20480 KB siblings : 16 core id : 7 cpu cores : 8
The above error got fixed as per your suggestion by increasing the Ulimit .
But do you think the above benchmark is good or should we decrease the number of files in the folder .
Please let me know your thoughts
Cant thank you enough for the help .
Created 06-07-2017 12:39 PM
Created on 06-13-2017 02:12 AM - edited 06-13-2017 06:31 AM
@mbigelow Sorry for the late response , was on Vacation :))
bleow is my answer to the questionaire
is size the amount after the merge?
Yes it .
What was the average size before?
between 50kb to 100 kb .
How long did it take to run?
10-15 minutes