Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to set stripe size for the orc file using python spark

Highlighted

Unable to set stripe size for the orc file using python spark

New Contributor

I have configured my SparkSession to set the ORC file stripe size to 128 MB but the spark dataframe is writing the files with small file sizes (~5MB).

 

#Creating Spark Session
from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local")\
.appName("test-optimal-orc")\
.config("spark.sql.orc.stripe.size", "134217728")\
.config("spark.sql.orc.impl", "native")\
.config("spark.sql.hive.convertMetastoreOrc", "true")\
.config("orc.stripe.size","134217728")\
.getOrCreate()

 

df.write.mode("overwrite").option("orc.stripe.size", "134217728").orc(<S3://location>)

 

When I checked one output orc file dump, it has only one stripe with 300000 rows in it.

 

Snippet of my orc file dump:

File Version: 0.12 with ORC_135
Rows: 300000
Compression: SNAPPY
Compression size: 262144
.
.
.

Stripe Statistics:
Stripe 1:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

File Statistics:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

Stripes:
Stripe: offset: 3 data: 5446727 rows: 300000 tail: 808 index: 13970
Stream: column 0 section ROW_INDEX start: 3 length 29
Stream: column 1 section ROW_INDEX start: 32 length 537
Stream: column 2 section ROW_INDEX start: 569 length 368
Stream: column 3 section ROW_INDEX start: 937 length 550

File length: 5463339 bytes
Padding length: 0 bytes
Padding ratio: 0%

 

 

I'm pointing these file location(s3) to the hive table. Could you please advise how to set the stripe and stride sizes while writing as ORC file using pyspark. 

Don't have an account?
Coming from Hortonworks? Activate your account here