Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Collaborator

In this tutorial, we will learn how to create Apache Ozone volumes, buckets, and keys. After that, we will see how we can access Apache Ozone data in Apache Spark.

Ozone

  1. Create the volume with the name vol1 in Apache Ozone.
    # ozone sh volume create /vol1
    21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with root as owner.
  2. Create the bucket with the name bucket1 under vol1.
    # ozone sh bucket create /vol1/bucket1
    21/08/25 06:24:09 INFO rpc.RpcClient: Creating Bucket: vol1/bucket1, with Versioning false and Storage Type set to DISK and Encryption set to false
  3. Create the employee.csv file to upload to Ozone.
    # vi /tmp/employee.csv
    
    id,name,age
    1,Ranga,33
    2,Nishanth,4
    3,Raja,60
  4. Upload the employee.csv file to Ozone
    # ozone sh key put /vol1/bucket1/employee.csv /tmp/employee.csv
  5. Add the fs.o3fs.impl property to core-site.xml
    • Go to Cloudera Manager > HDFS > Configuration > search for core-site.xml > Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
      <property>
        <name>fs.o3fs.impl</name>
        <value>org.apache.hadoop.fs.ozone.OzoneFileSystem</value>
      </property>
  6. Display the files created earlier using 'hdfs' command.
    Note: Before running the following command, update the om-host.example.com value.
    hdfs dfs -ls o3fs://bucket1.vol1.om-host.example.com/

Spark

  1. Launch spark-shell
    spark spark-shell
  2. Run the following command to print the employee.csv file content.
    Note: Update the omHost value.
    scala> val omHost="om.host.example.com"
    
    scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv")
    
    scala> df.show()
    +---+--------+---+
    | id|    name|age|
    +---+--------+---+
    |  1|   Ranga| 33|
    |  2|Nishanth|  4|
    |  3|    Raja| 60|
    +---+--------+---+​

Kerberized environment

Pre-requisites:

  1. Create a user and provide proper Ranger permissions to create Ozone volume and buckets, etc.
  2. kinit with the user

Steps:

  1. Create Ozone volumes, buckets, and keys mentioned in Ozone section.
  2. Launch spark-shell
  3. Replace the KEY_TAB, PRINCIPAL, and om.host.example.com in spark-shell
    spark-shell \
    	--keytab ${KEY_TAB} \
    	--principal ${PRINCIPAL} \
    	--conf spark.yarn.access.hadoopFileSystems=o3fs://bucket1.vol1.om.host.example.com:9862
    Note: In a Kerberized environment, mandatorily, we need to specify the spark.yarn.access.hadoopFileSystems configuration, otherwise, it will display the following error:
    java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]​
  4. Run the following command to print the employee.csv file content.
    Note: Update the omHost value.
    scala> val omHost="om.host.example.com"
    
    scala> val df=spark.read.option("header", "true").option("inferSchema", "true").csv(s"o3fs://bucket1.vol1.${omHost}/employee.csv")
    
    scala> df.show()
    +---+--------+---+
    | id|    name|age|
    +---+--------+---+
    |  1|   Ranga| 33|
    |  2|Nishanth|  4|
    |  3|    Raja| 60|
    +---+--------+---+
    
    scala> val age30DF = df.filter(df("age") > 30)
    
    scala> val outputPath = s"o3fs://bucket1.vol1.${omHost}/employee_age30.csv"
    
    scala> age30DF.write.option("header", "true").mode("overwrite").csv(outputPath)
    
    scala> val df2=spark.read.option("header", "true").option("inferSchema", "true").csv(outputPath)
    
    scala> df2.show()
    +---+-----+---+
    | id| name|age|
    +---+-----+---+
    |  1|Ranga| 33|
    |  3| Raja| 60|
    +---+-----+---+

    Note: If you get the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not found, add the /opt/cloudera/parcels/CDH/jars/hadoop-ozone-filesystem-hadoop3-*.jar to spark class path using --jars option.

Thanks for reading this article. If you liked this article, you can give kudos.

4,358 Views
Comments
avatar
New Contributor

Hi,

I am looking for solution and help regarding Apache Ozone. I am facing issue while installing apache ozone in my CDP PVC base 7.1.6.
steps that I followed :-

1) add service

2) click on Ozone service

3) Give ozone.service.id

4) I have choose all dependencies including ranger, HDFS.

I have checked logs of Ozone manager and Data Nodes, So on data nodes and ozone manager machine the exception is same i.e., 

log4j:ERROR Could not instantiate class [org.cloudera.log4j.redactor.RedactorAppender].
java.lang.ClassNotFoundException: org.cloudera.log4j.redactor.RedactorAppender

 question_ask_community.PNG

 

Please let me know if there's any solution available.

 

Thank you,

Parimal

avatar
Contributor

Hi @parimalpatil 

The RedactorAppender is mostly you can ignore it is nothing to do with real failure unless the stacktraces at bottom points something related to any ozone roles.
This Log4j Appender redacts log messages using redaction rules before delegating to other Appenders. You can share the complete failure log so that we can check and update you.

The workaround is add jar file in classpath of roles where you see RedactorAppender error.
We can add this through CM UI -> Configuration-> Search "role_env_safety_valve" for the role you are getting error.
OZONE_CLASSPATH=$OZONE_CLASSPATH:/opt/cloudera/parcels/CDH/jars/logredactor-2.0.8.jar