Support Questions
Find answers, ask questions, and share your expertise

Unable to set stripe size for the orc file using python spark

New Contributor

I have configured my SparkSession to set the ORC file stripe size to 128 MB but the spark dataframe is writing the files with small file sizes (~5MB).

 

#Creating Spark Session
from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local")\
.appName("test-optimal-orc")\
.config("spark.sql.orc.stripe.size", "134217728")\
.config("spark.sql.orc.impl", "native")\
.config("spark.sql.hive.convertMetastoreOrc", "true")\
.config("orc.stripe.size","134217728")\
.getOrCreate()

 

df.write.mode("overwrite").option("orc.stripe.size", "134217728").orc(<S3://location>)

 

When I checked one output orc file dump, it has only one stripe with 300000 rows in it.

 

Snippet of my orc file dump:

File Version: 0.12 with ORC_135
Rows: 300000
Compression: SNAPPY
Compression size: 262144
.
.
.

Stripe Statistics:
Stripe 1:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

File Statistics:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

Stripes:
Stripe: offset: 3 data: 5446727 rows: 300000 tail: 808 index: 13970
Stream: column 0 section ROW_INDEX start: 3 length 29
Stream: column 1 section ROW_INDEX start: 32 length 537
Stream: column 2 section ROW_INDEX start: 569 length 368
Stream: column 3 section ROW_INDEX start: 937 length 550

File length: 5463339 bytes
Padding length: 0 bytes
Padding ratio: 0%

 

 

I'm pointing these file location(s3) to the hive table. Could you please advise how to set the stripe and stride sizes while writing as ORC file using pyspark.