Support Questions

Find answers, ask questions, and share your expertise

How to set orc.stripe.size value in hive table which stores hundreds of millions?

avatar
New Contributor

I am trying to create several tables with hundreds of millions of columns.

Tables are created and data is added while changing the orc.stripe.size value (64MB or 256MB) for each table because the size of each table is different. (If there is a lot of table data, an error occurred when setting 64MB, so the value was increased to 256MB.)


However, if the orc.stripe.size value is set large (ex. 256MB) for a table with a relatively small number of data, the following error occurs.
I don't know how to set this value and create it. Is there a way to set the value according to the table size?

 

SessionState: Vertex failed, vertexName=Map 1, vertexId=vertex_1666932355626_0180_236_01, diagnostics=[Task failed, taskId=task_1666932355626_0180_236_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1666932355626_0180_236_01_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.EOFException: Can't finish byte read from uncompressed stream DATA position: 81920 length: 81920 range: 4 offset: 65536 position: 16384 limit: 16384
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
    at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
    at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
    at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
5 REPLIES 5

avatar
Guru

@octor I see you have already raised https://issues.apache.org/jira/browse/HIVE-26720

 

May I know why do you want to change stripe size?

avatar
New Contributor

If the amount of table data is large (more than billions of rows), another error occurred when creating the table with the default stripe size of 64MB. So, when I increased it to 256MB, it was confirmed that it works normally. 

avatar
Guru

@octor You can tewk the split sizes if you are using Tez.

 

  1. set tez.grouping.min-size=16777216;--16 MB min split
  2. set tez.grouping.max-size=64000000;--64 GB max split

avatar
New Contributor

The current tez settings are as follows.

- tez.grouping.max-size=52428800; -- 50MB
- tez.grouping.max-size=1073741824; -- 1GB

 

I think the max-size you mentioned is 64M.

set tez.grouping.min-size=16777216;--16 MB min split
set tez.grouping.max-size=64000000;--64 GB max split -> 64MB

 

From my experience while creating and adding data into tables, I think that if orc.stripe.size is set between 64MB and 256MB, i can create tables and add data smoothly. Can you tell me roughly what range I should take tez.grouping.min/max-size?

avatar
Guru

You can set  tez.grouping.max-size to 1gb.

 

Please increase below

 

set hive.tez.container.size=10240;

set tez.runtime.io.sort.mb=4096; (40% of container size)