Member since
10-06-2019
3
Posts
0
Kudos Received
0
Solutions
10-07-2019
04:47 PM
Hi, I'm using Hive version:2.3.4 and Spark: 2.4.4 with Hadoop: 2.8.5 but still my pyspark code is not taking my Stripe size parameter for ORC creation. I have posted a new question this community as well. https://community.cloudera.com/t5/Support-Questions/Unable-to-set-stripe-size-for-the-orc-file-using-python/td-p/278918 Could you please advise on this. Thanks, Sai
... View more
10-06-2019
06:15 PM
I have configured my SparkSession to set the ORC file stripe size to 128 MB but the spark dataframe is writing the files with small file sizes (~5MB).
#Creating Spark Session from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local")\ .appName("test-optimal-orc")\ .config("spark.sql.orc.stripe.size", "134217728")\ .config("spark.sql.orc.impl", "native")\ .config("spark.sql.hive.convertMetastoreOrc", "true")\ .config("orc.stripe.size","134217728")\ .getOrCreate()
df.write.mode("overwrite").option("orc.stripe.size", "134217728").orc(<S3://location>)
When I checked one output orc file dump, it has only one stripe with 300000 rows in it.
Snippet of my orc file dump:
File Version: 0.12 with ORC_135 Rows: 300000 Compression: SNAPPY Compression size: 262144 . . .
Stripe Statistics: Stripe 1: Column 0: count: 300000 hasNull: false Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000 Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000 Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000 . . .
File Statistics: Column 0: count: 300000 hasNull: false Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000 Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000 Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000 . . .
Stripes: Stripe: offset: 3 data: 5446727 rows: 300000 tail: 808 index: 13970 Stream: column 0 section ROW_INDEX start: 3 length 29 Stream: column 1 section ROW_INDEX start: 32 length 537 Stream: column 2 section ROW_INDEX start: 569 length 368 Stream: column 3 section ROW_INDEX start: 937 length 550
File length: 5463339 bytes Padding length: 0 bytes Padding ratio: 0%
I'm pointing these file location(s3) to the hive table. Could you please advise how to set the stripe and stride sizes while writing as ORC file using pyspark.
... View more
Labels: