Support Questions

Find answers, ask questions, and share your expertise

error creating ShortCircuitReplica - Merge Parquet

avatar
Champion

We are having hdfs small size problem , so we were using this code from Github for merging parquet file . 

 

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools

 

Step 1  - Performed Local Maven build 

 - maven build 

- mvn clean package
 
Step 2  - Command 
Note - File size -   50kb   (2500 number of files ) - Total folder size is 2.5 GB 
hadoop jar <jar file name> merge <input path> <output file>
 
hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet
 
HDFS Report is Healthy 
 
 Total size:    1796526652 B
 Total dirs:    1
 Total files:    4145
 Total symlinks:        0
 Total blocks (validated):    4146 (avg. block size 433315 B)
 Minimally replicated blocks:    4146 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        3
 Number of racks:        1
FSCK ended at Tue Jun 06 13:04:57 IST 2017 in 82 milliseconds


The filesystem under path '/user/hive/warehouse/final_parsing.db/02day02/' is HEALTHY
 
 
 
 
 
Our current client configuration . 
Below are the current configs that didn't help us . 
so we should bump it up But it didnt help us either Same error . dfs.blocksize - 134217728 equals 64 MB dfs.client-write-packet-size 65536 dfs.client.read.shortcircuit.streams.cache.expiry.ms 300000 dfs.stream-buffer-size 4096 dfs.client.read.shortcircuit.streams.cache.size is 256 Linux Ulimit - 3024
 
Error stack race 
 
[UD1@slave1 target]# hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet

17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: BlockReaderFactory(fileName=/user/hive/warehouse/final_parsing.db/02day02/part-r-00000-377a9cc1-841a-4ec6-9e0f-0c009f44f6b3.snappy.parquet, block=BP-1780335730-192.168.200.234-1492815207875:blk_1074179291_439068): error creating ShortCircuitReplica.
java.io.IOException: Illegal seek
	at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
	at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:124)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126)
	at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:619)
	at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:551)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:484)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64)
	at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496)
	at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494)
	at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79)
	at org.apache.parquet.tools.Main.main(Main.java:223)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
17/06/06 12:48:16 WARN shortcircuit.ShortCircuitCache: ShortCircuitCache(0x32442dd0): failed to load 1074179291_BP-1780335730-192.168.200.234-1492815207875
17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: Too many open files
	at sun.nio.ch.Net.socket0(Native Method)
	at sun.nio.ch.Net.socket(Net.java:423)
	at sun.nio.ch.Net.socket(Net.java:416)
	at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:104)
	at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
	at java.nio.channels.SocketChannel.open(SocketChannel.java:142)
	at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62)
	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3526)
	at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:840)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:755)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64)
	at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496)
	at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494)
	at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79)
	at org.apache.parquet.tools.Main.main(Main.java:223)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 
Correct me if I am wrong this comes because the Cache is getting expired soon before it comes back to check the block . 
 
Could anyone recommend a solution for this 
5 REPLIES 5

avatar
Champion
The parquet-tools are available with CDH and I recommend using that one as it is built for your version of CDH. Check under /opt/cloudera/parcels/CDH/lib/parquet/parquet-tools.jar.

The warning seems to indicate that the parquet tools are trying to use the short circuit feature to bypass the NN and DN, but it does seem to failback to normal block access methods after failing that.

The actual error is too many open files. Try the ulimit -a or ulimit -Hn or ulimit -Sn. This will show the limits on the number of open files the logged in user can have. The default for RHEL/Centos has been 1024 for some time. You are trying to open 2500 files at once. Increase it to 2500+ or reduce the number of files you are trying to merge together at once.

avatar
Champion

@mbigelow

 

Thanks for the quick trurn around . Will try and let you know the results. 

 

 

avatar
Champion

@mbigelow 

 

Parquet - Merge

 

No. of files 	Ulimit 	       Size (GB)
5000 	        15000 	            2.1
4140 	        10000 	            1.4

System Configuration

 

model name	: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz		: 1257.457
cache size	: 20480 KB
cpu cores	: 8
processor	: 15
cpu MHz		: 1212.914
cache size	: 20480 KB
siblings	: 16
core id		: 7
cpu cores	: 8

 

The above error got fixed as per your suggestion by increasing the  Ulimit .

But do you think the above benchmark is good or should we decrease the number of files in the  folder .

Please let me know your thoughts

 

Cant thank you enough for the help .

 

avatar
Champion
Is size the amount after the merge? What was the average size before? How long did it take to run?

avatar
Champion

@mbigelow Sorry for the late response , was on Vacation :)) 

bleow is my answer to the questionaire 

 

is size the amount after the merge?

Yes it .

What was the average size before?

between 50kb to 100 kb .

How long did it take to run?

10-15 minutes