I 'm trying to create an ORC-table, which can store orc-files as 'native' OrcFileFormat. Unfortunately, I 'm having an issue with using partitions and saving orc-files as 'native' OrcFileFormat.
1. For partitioning, I'm using the 'PARTITION BY COLUMN' clause. This results into the following query:
CREATE TABLE livedm.events ( day DATE, datetime TIMESTAMP, description STRING, eventtype STRING ) USING HIVE OPTIONS (fileFormat 'ORC') PARTITIONED BY (eventtype )
2. Further on, I'm using (Py)Spark to insert data into the table. In the spark context, I added some configurations, such as spark.sql.orc.impl=native.
# Create SparkContext with configurations for new OrcFormat session_builder = pyspark.sql.SparkSession.builder \ .enableHiveSupport() \ .appName("test-orc") \ .config("spark.sql.hive.convertMetastoreOrc", "true") \ .config("spark.sql.orc.cache.stripe.details.size", "10000") \ .config("spark.sql.orc.enabled", "true") \ .config("spark.sql.orc.filterPushdown", "true") \ .config("spark.sql.orc.splits.include.file.footer", "true") \ .config("spark.sql.hive.metastorePartitionPruning", "true") \ .config("spark.sql.orc.impl","native") \ .config("spark.sql.orc.enableVectorizedReader", "true") spark = session_builder.getOrCreate() # event_df is a dataframe with eventdata event_df = spark.sql("select * from tempTable") event_df.write.mode("append").format("org.apache.spark.sql.execution.datasources.orc").insertInto("livedm.events")
3. Next, I'm using the following command to check the OrcFileFormat:
%sh java -jar /opt/orc-test/orc-tools-1.4.3-uber.jar meta hdfs://root/apps/hive/warehouse/livedm.db/events 2> /dev/null | grep Version | head -n1
Which results into:
File Version: 0.12 with HIVE_8732
Unfortunately, orc-files have been stored as 'hive' orcFileFormat
When I 'm repeating the earlier steps, without using any partitioning on step 1, the orc-files are successfully being stored as 'native' orcFileFormat. By using the command from step 3, the following are being returned:
File Version: 0.12 with ORC_135
How can I use partitioning (properly) when creating an Orc-table, and store data as 'native ' orcFileFormat?
Solved it by replacing the "USING HIVE OPTIONS (fileFormat 'ORC')"-clause with "USING ORC"-clause.
I´m really wondering about if there something in which works well in cloudera platform without being annoying!