Support Questions

csguna · ‎06-06-2017

We are having hdfs small size problem , so we were using this code from Github for merging parquet file .

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools

Step 1 - Performed Local Maven build

- maven build

- mvn clean package

Step 2 - Command

Note - File size - 50kb (2500 number of files ) - Total folder size is 2.5 GB

hadoop jar <jar file name> merge <input path> <output file>

hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet

HDFS Report is Healthy

 Total size:    1796526652 B
 Total dirs:    1
 Total files:    4145
 Total symlinks:        0
 Total blocks (validated):    4146 (avg. block size 433315 B)
 Minimally replicated blocks:    4146 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        3
 Number of racks:        1
FSCK ended at Tue Jun 06 13:04:57 IST 2017 in 82 milliseconds


The filesystem under path '/user/hive/warehouse/final_parsing.db/02day02/' is HEALTHY

Our current client configuration .

Below are the current configs that didn't help us . 

so we should bump it up But it didnt help us either Same error . 

dfs.blocksize - 134217728 equals 64 MB

dfs.client-write-packet-size
65536


dfs.client.read.shortcircuit.streams.cache.expiry.ms
300000

dfs.stream-buffer-size
4096
dfs.client.read.shortcircuit.streams.cache.size 
is 256

Linux Ulimit  - 3024

Error stack race

[UD1@slave1 target]# hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar merge /user/hive/warehouse/final_parsing.db/02day02/ /user/hive/warehouse/final_parsing.db/02day02/merged.parquet

17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: BlockReaderFactory(fileName=/user/hive/warehouse/final_parsing.db/02day02/part-r-00000-377a9cc1-841a-4ec6-9e0f-0c009f44f6b3.snappy.parquet, block=BP-1780335730-192.168.200.234-1492815207875:blk_1074179291_439068): error creating ShortCircuitReplica.
java.io.IOException: Illegal seek
	at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
	at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:124)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126)
	at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:619)
	at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:551)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
	at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:484)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64)
	at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496)
	at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494)
	at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79)
	at org.apache.parquet.tools.Main.main(Main.java:223)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
17/06/06 12:48:16 WARN shortcircuit.ShortCircuitCache: ShortCircuitCache(0x32442dd0): failed to load 1074179291_BP-1780335730-192.168.200.234-1492815207875
17/06/06 12:48:16 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: Too many open files
	at sun.nio.ch.Net.socket0(Native Method)
	at sun.nio.ch.Net.socket(Net.java:423)
	at sun.nio.ch.Net.socket(Net.java:416)
	at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:104)
	at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
	at java.nio.channels.SocketChannel.open(SocketChannel.java:142)
	at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62)
	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3526)
	at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:840)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:755)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.parquet.hadoop.util.H2SeekableInputStream.read(H2SeekableInputStream.java:64)
	at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:480)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:580)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:496)
	at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:494)
	at org.apache.parquet.tools.command.MergeCommand.execute(MergeCommand.java:79)
	at org.apache.parquet.tools.Main.main(Main.java:223)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Correct me if I am wrong this comes because the Cache is getting expired soon before it comes back to check the block .

Could anyone recommend a solution for this

mbigelow · ‎06-06-2017

The parquet-tools are available with CDH and I recommend using that one as it is built for your version of CDH. Check under /opt/cloudera/parcels/CDH/lib/parquet/parquet-tools.jar.

The warning seems to indicate that the parquet tools are trying to use the short circuit feature to bypass the NN and DN, but it does seem to failback to normal block access methods after failing that.

The actual error is too many open files. Try the ulimit -a or ulimit -Hn or ulimit -Sn. This will show the limits on the number of open files the logged in user can have. The default for RHEL/Centos has been 1024 for some time. You are trying to open 2500 files at once. Increase it to 2500+ or reduce the number of files you are trying to merge together at once.

csguna · ‎06-06-2017

@mbigelow

Thanks for the quick trurn around . Will try and let you know the results.

csguna · ‎06-07-2017

@mbigelow

Parquet - Merge

No. of files 	Ulimit 	       Size (GB)
5000 	        15000 	            2.1
4140 	        10000 	            1.4

System Configuration

model name	: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz		: 1257.457
cache size	: 20480 KB
cpu cores	: 8
processor	: 15
cpu MHz		: 1212.914
cache size	: 20480 KB
siblings	: 16
core id		: 7
cpu cores	: 8

The above error got fixed as per your suggestion by increasing the Ulimit .

But do you think the above benchmark is good or should we decrease the number of files in the folder .

Please let me know your thoughts

Cant thank you enough for the help .

mbigelow · ‎06-07-2017

Is size the amount after the merge? What was the average size before? How long did it take to run?

csguna · ‎06-13-2017

@mbigelow Sorry for the late response , was on Vacation :))

bleow is my answer to the questionaire

is size the amount after the merge?

Yes it .

What was the average size before?

between 50kb to 100 kb .

How long did it take to run?

10-15 minutes

Cloudera Community

Support Questions

error creating ShortCircuitReplica - Merge Parquet