Community Articles

subratadas · ‎09-08-2021

In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how to create an Apache Hive table using Apache Ozone, and finally how we can insert/read the data from Apache Spark.

Ozone

Create the volume with the name vol1.

# ozone sh volume create /vol1
21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner.

Create the bucket with the name bucket1 under vol1.

# ozone sh bucket create /vol1/bucket1
21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false

Hive

Launch the beeline shell.
Create the employee table in Hive.

Note: Update the om.host.example.com value.

CREATE DATABASE IF NOT EXISTS ozone_db;
USE  ozone_db;

CREATE EXTERNAL TABLE IF NOT EXISTS `employee`(                  
   `id` bigint,                                     
   `name` string,                                   
   `age` smallint)
STORED AS parquet 
LOCATION 'o3fs://bucket1.vol1.om.host.example.com/employee';

Spark

Spark2:

Launch spark-shell
```
spark-shell
```

Run the following query to insert/read the data from the Hive employee table.

spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""")

spark.sql("SELECT * FROM ozone_db.employee").show()

Spark3:

Launch spark3-shell

spark3-shell --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar

Run the following query to insert/read the data from the Hive employee table.

spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""")

spark.sql("SELECT * FROM ozone_db.employee").show()

Kerberized environment

Pre-requisites:

Create a user and provide proper Ranger permissions to create Ozone volume and buckets, etc.
kinit with the user.

Spark2:

Launch spark-shell
Note: Before launching spark-shell update the om.host.example.com value.

spark-shell \
	--conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862

Run the following query to insert/read the data from Hive employee table.

spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""")

spark.sql("SELECT * FROM ozone_db.employee").show()

Spark3:

Launch spark3-shell
Note: Before launching spark-shell update the om.host.example.com value.

spark3-shell \
--conf spark.kerberos.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862 \ 
--jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar

Run the following query to insert/read the data from the Hive employee table.

spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (1, "Ranga", 33)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (2, "Nishanth", 3)""")
spark.sql("""INSERT INTO TABLE ozone_db.employee VALUES (3, "Raja", 59)""")

spark.sql("SELECT * FROM ozone_db.employee").show()

Notes:

If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not foundthen add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option.
In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error.
```
java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
```

Thanks for reading this article. If you liked this article, you can give kudos.

Cloudera Community

Community Articles

Spark Hive Ozone Integration in CDP

Apache Hive

Apache Spark

Ozone

Hive

Spark

Spark2:

Spark3:

Kerberized environment

Pre-requisites:

Spark2:

Spark3:

Hive-Kafka Integration CDP

Streamlining Data Processing with Spark HBase Inte...

HBase Spark in CDP

Spark HWC integration with Hive UDFs

How to access Ozone file system using Java API

What is HDFS Ozone?

Connect to CDP DataHub Hive using Cloudera ODBC Dr...

How to integrate a Feature Store on Cloudera Machi...

Druid Kafka Integration Service + Hive

How to Spark Roll Event Log Files in CDP