Created on
08-25-2021
08:51 PM
- edited on
09-08-2021
08:31 PM
by
subratadas
In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how we can access Apache Ozone data in Apache Spark.
# ozone sh volume create /vol1
21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner.
# ozone sh bucket create /vol1/bucket1
21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false
# vi /tmp/employee.csv
id,name,age
1,Ranga,33
2,Nishanth,4
3,Raja,60
# ozone sh key put /vol1/bucket1/employee.csv /tmp/employee.csv
<property>
<name>fs.o3fs.impl</name>
<value>org.apache.hadoop.fs.ozone.OzoneFileSystem</value>
</property>
hdfs dfs -ls o3fs://bucket1.vol1.om-host.example.com/
spark spark-shell
scala> val omHost="om.host.example.com"
scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv")
scala> df.show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| Ranga| 33|
| 2|Nishanth| 4|
| 3| Raja| 60|
+---+--------+---+
spark-shell \
--keytab ${KEY_TAB} \
--principal ${PRINCIPAL} \
--conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862
Note: In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
scala> val omHost="om.host.example.com"
scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv")
scala> df.show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| Ranga| 33|
| 2|Nishanth| 4|
| 3| Raja| 60|
+---+--------+---+
scala> val age30DF = df.filter(df("age") > 30)
scala> val outputPath = s"o3fs://bucket1.vol1.${omHost}/employee_age30.csv"
scala> age30DF.write.option("header", "true").mode("overwrite").csv(outputPath)
scala> val df2=spark.read.option("header", "true").option("inferSchema", "true").csv(outputPath)
scala> df2.show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Ranga| 33|
| 3| Raja| 60|
+---+-----+---+
Note: If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not found, add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option.
Thanks for reading this article. If you liked this article, you can give kudos.
Created on 07-27-2022 01:32 AM - edited 07-27-2022 01:35 AM
Hi,
I am looking for solution and help regarding Apache Ozone. I am facing issue while installing apache ozone in my CDP PVC base 7.1.6.
steps that I followed :-
1) add service
2) click on Ozone service
3) Give ozone.service.id
4) I have choose all dependencies including ranger, HDFS.
I have checked logs of Ozone manager and Data Nodes, So on data nodes and ozone manager machine the exception is same i.e.,
log4j:ERROR Could not instantiate class [org.cloudera.log4j.redactor.RedactorAppender]. java.lang.ClassNotFoundException: org.cloudera.log4j.redactor.RedactorAppender
Please let me know if there's any solution available.
Thank you,
Parimal
Created on 01-23-2024 07:10 AM
Hi @parimalpatil
The RedactorAppender is mostly you can ignore it is nothing to do with real failure unless the stacktraces at bottom points something related to any ozone roles. This Log4j Appender redacts log messages using redaction rules before delegating to other Appenders. You can share the complete failure log so that we can check and update you.
The workaround is add jar file in classpath of roles where you see RedactorAppender error.
We can add this through CM UI -> Configuration-> Search "role_env_safety_valve" for the role you are getting error.
OZONE_CLASSPATH=$OZONE_CLASSPATH:/opt/cloudera/parcels/CDH/jars/logredactor-2.0.8.jar
Created on 03-12-2025 02:06 AM
Hi,
Any idea how to access Ozone FS when submitting spark job through Livy?
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:808) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at java.security.AccessController.doPrivileged(AccessController.java:712) ~[?:?]
at javax.security.auth.Subject.doAs(Subject.java:439) ~[?:?]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:771) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:866) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1506) ~[hadoop-common-3.1.1.7.1.9.14-2.jar:?]
... 48 more
Created on 03-13-2025 11:13 PM - edited 03-13-2025 11:14 PM
Hi @jaris
How are you passing the kerberos ticket?
Did you run a kinit , before the livy statement execution.
Obtain a kerberos ticket via kinit and then try to execute the below sample job
1. Copy the JAR to HDFS:
# hdfs dfs -put /opt/cloudera/parcels/CDH/jars/spark-examples<VERSION>.jar /tmp
2. Make sure the JAR is present.
# hdfs dfs -ls /tmp/
3. CURL command to run the Spark job using Livy API.
# curl -v -u: --negotiate -X POST --data '{"className": "org.apache.spark.examples.SparkPi", "jars": ["/tmp/spark-examples<VERSION>.jar"], "name": "livy-test", "file": "hdfs:///tmp/spark-examples<VERSION>.jar", "args": [10]}' -H "Content-Type: application/json" -H "X-Requested-By: User" http://<LIVY_NODE>:<PORT>/batches
4. Check for the running and completed Livy sessions.
# curl http://<LIVY_NODE>:<PORT>/batches/ | python -m json.tool
NOTE:
* Change the JAR version ( <VERSION> ) according your CDP version.
* Replace the LIVY_NODE and PORT with the actual values.
* If you are running the cluster in secure mode, then make sure you have a valid Kerberos ticket and use the Kerberos authentication in curl command.
Created on 03-17-2025 03:46 AM - edited 03-17-2025 03:50 AM
Hi @haridjh
Thanks for reply. Procedure you described in your reply is utilizing HDFS as a store for JAR files used by Spark job. We don't have problem to utilize HDFS in Spark job, problem is when trying to access Ozone FS, e.g. (ofs) when job is submitted via LIVY.
1. Access files on Ozone in spark job e.g.:
df = spark.read.parquet("ofs://ozone-service/volume/bucket/parquet")
2. Python job submitted via Livy:
kinit user
curl --negotiate -k -v -u : -X POST \ -H "Content-Type: application/json" \ --data '{ "file": "ozone_access.py"}' \ https://livy:28998/batches
3. Job is failing with:
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
When we are trying to access Ozone normally via spark-shell or spark-submit, everything works fine, e.g.:
spark-shell \
--keytab ${KEY_TAB} \
--principal ${PRINCIPAL} \
--conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862
Setting keytab and principal is not possible when submitting job via Livy, because we are using proxy users with Livy. Thanks.