Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Creating ORC table - using new ORC file, hive and a column partitioning

avatar

I 'm trying to create an ORC-table, which can store orc-files as 'native' OrcFileFormat. Unfortunately, I 'm having an issue with using partitions and saving orc-files as 'native' OrcFileFormat.


With Partitioning

1. For partitioning, I'm using the 'PARTITION BY COLUMN' clause. This results into the following query:

CREATE TABLE livedm.events  (
     day DATE,
     datetime TIMESTAMP,
     description STRING,
     eventtype STRING
)
USING HIVE OPTIONS (fileFormat 'ORC')
PARTITIONED BY (eventtype )


2. Further on, I'm using (Py)Spark to insert data into the table. In the spark context, I added some configurations, such as spark.sql.orc.impl=native.

# Create SparkContext with configurations for new OrcFormat
session_builder = pyspark.sql.SparkSession.builder \
    .enableHiveSupport() \
    .appName("test-orc") \
    .config("spark.sql.hive.convertMetastoreOrc", "true") \
    .config("spark.sql.orc.cache.stripe.details.size", "10000") \
    .config("spark.sql.orc.enabled", "true") \
    .config("spark.sql.orc.filterPushdown", "true") \
    .config("spark.sql.orc.splits.include.file.footer", "true") \
    .config("spark.sql.hive.metastorePartitionPruning", "true") \
    .config("spark.sql.orc.impl","native") \
    .config("spark.sql.orc.enableVectorizedReader", "true")   
spark = session_builder.getOrCreate()


# event_df is a dataframe with eventdata
event_df = spark.sql("select * from tempTable")
event_df.write.mode("append").format("org.apache.spark.sql.execution.datasources.orc").insertInto("livedm.events")


3. Next, I'm using the following command to check the OrcFileFormat:

%sh
java -jar /opt/orc-test/orc-tools-1.4.3-uber.jar meta hdfs://root/apps/hive/warehouse/livedm.db/events 2> /dev/null | grep Version | head -n1

Which results into:

File Version: 0.12 with HIVE_8732

Unfortunately, orc-files have been stored as 'hive' orcFileFormat


Without Partitioning

When I 'm repeating the earlier steps, without using any partitioning on step 1, the orc-files are successfully being stored as 'native' orcFileFormat. By using the command from step 3, the following are being returned:

File Version: 0.12 with ORC_135


Question

How can I use partitioning (properly) when creating an Orc-table, and store data as 'native ' orcFileFormat?

2 REPLIES 2

avatar

Solved it by replacing the "USING HIVE OPTIONS (fileFormat 'ORC')"-clause with "USING ORC"-clause.

avatar
New Contributor

I´m really wondering about if there something in which works well in cloudera platform without being annoying!