Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why does Hive OrcWriter performance / throughput does not increase with concurrent writers?

Highlighted

Why does Hive OrcWriter performance / throughput does not increase with concurrent writers?

New Contributor

I am using org.apache.hadoop.hive.ql.io.orc.Writer to write ORC files directly to HDFS. I have noticed is that I can write upto 100MB worth of records/rows (which equals to about 100K rows based on my java pojo) in about 22 seconds with 1 writer in 1 thread.

However, if I increase the number of concurrent writers to say 2, the time to write 2 x 100MB worth of records/rows in (100MB in each file) takes about 45 seconds. Ideally I would expect this to also complete in 22 seconds thereby doubling the throughput.

Is there some locking/throttling involved here with Hive Orc Writer or HDFS?

Here are configs for my OrcFile Writer Options:

- bufferSize = 10000

- stripeSize = 100000

- compress = SNAPPY

- version = OrcFile.Version.CURRENT

- blockSize = 67108864

I am using Hive version 0.14. Any help appreciated.

3 REPLIES 3
Highlighted

Re: Why does Hive OrcWriter performance / throughput does not increase with concurrent writers?

So you write directly from a Java Pojo? Could you tell us what is the reason for that? Normally ORC speed is increased by increasing the numbers of tasks in the cluster. Reducers/Mappers/Storm bolts ...

Its actually an interesting question. Can you send us the code how you did it? ORC Writers keep a central memory management to make sure they do not overrun the memory of the java task. So there may be some central bottlenecks in a single JVM.

Re: Why does Hive OrcWriter performance / throughput does not increase with concurrent writers?

New Contributor

Yes, I directly write a Java Pojo via OrcWriter.

Below is a sample pseudo code:

	                final Writer writer;
                try {
                ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(
                JavaPojo.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
                Configuration hdfsConfig = new Configuration();
        hdfsConfig.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        hdfsConfig.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        options = OrcFile.writerOptions(hdfsConfig)
                .inspector(inspector)
                .stripeSize(orcFileStripSizeBytes)
                .bufferSize(orcFileWriterBufferSizeBytes)
                .compress(CompressionKind.valueOf(orcFileCompression))
                .blockSize(blockSize)
                .version(OrcFile.Version.CURRENT);
                    writer = OrcFile.createWriter(new Path("hdfs://my-dev-box:8020/tmp/"
                            + UUID.randomUUID().toString().substring(0, 6) + "-test.data"), );
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
                for (int i = 0; i < 100_000; i++) {
                   JavaPojo pojo = getNewRecord();
                   writer.write(pojo);
                }
                writer.close();

The values for block size, stripe size and buffers are as mentioned in my first post. Is there a better way to write ORC to hdfs or use a different writer?

Highlighted

Re: Why does Hive OrcWriter performance / throughput does not increase with concurrent writers?

Mentor

I would refer to the source code of storm-hive or flume hive sink as these projects suggest pretty fast throughput. they're using Hive streaming though https://github.com/apache/flume/tree/trunk/flume-ng-sinks/flume-hive-sink/src

Don't have an account?
Coming from Hortonworks? Activate your account here