About Navink

Navink · ‎07-15-2024

Prerequisite : 1. We should have the fsimage to continue with this analysis. 2.Access to git What is small files ? A small file is one which is significantly smaller than the default Apache Hadoop HDFS default block size (128MB by default in CDH/CDP). One should note that it is expected and inevitable to have some small files on HDFS. These are files like library jars, XML configuration files, temporary staging files, and so on. But when small files become a significant part of datasets, the problems arise. Problems with small files and HDFS Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading to inefficient Namenode memory utilisation and RPC calls, block scanning throughput degradation, and reduced application layer performance. Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible. Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern. Why Worry About Small Files? The HDFS NameNode architecture, explained here mentions that "the NameNode keeps an image of the entire file system namespace and file Blockmap in memory." What this means is that every file in HDFS adds some pressure to the memory capacity for the NameNode process. Therefore, a larger max heap for the NameNode Java process will be required as the files system grows. Problems with small files and MapReduce Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file. How to use this script you can git clone https://github.com/clukasikhw/small_file_offenders 1. Get the fsimage for your cluster you would like to do analysis for small files . 2. Process the image file in TSV format to do so we will be using "Delimited" processors of the Offline Image Viewer tool. Go to under perl directory of cloned source at path - small_file_offenders/src/main/perl and run OIV tool with Delimited processor pointing your fsimage location. This will generate the TSV file which required for running this per script. perl]# hadoop oiv -i /path/XXXXX/current/fsimage_0000000000000019171 -o fsimage-delimited.tsv -p Delimited You will get the TSV file generated. perl]# ls -lart -rwxr-xr-x. 1 root root 2335 May 24 15:11 fsimage_users.pl -rw-r--r--. 1 root root 531226 May 24 15:12 fsimage-delimited.tsv 3. Now you can invoke your perl script "fsimage_user.pl" pointing the tsv file . perl]# ./fsimage_users.pl ./fsimage-delimited.tsv perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). Limiting output to top 10 items per list. A small file is considered anything less than 134217728. Edit the script to adjust these values Average File Size (bytes): 0; Users: hue (total size: 0; number of files: 535) hdfs (total size: 0; number of files: 13) mapred (total size: 0; number of files: 4) spark (total size: 0; number of files: 3) impala (total size: 0; number of files: 2) solr (total size: 0; number of files: 1) systest (total size: 0; number of files: 1) schemaregistry (total size: 0; number of files: 1) admin (total size: 0; number of files: 1) kafka (total size: 0; number of files: 1) Average File Size (bytes): 2526.42660550459; Users: hbase (total size: 550761; number of files: 218) Average File Size (bytes): 28531.875; Users: livy (total size: 456510; number of files: 16) Average File Size (bytes): 2438570.8308329; Users: oozie (total size: 4713757416; number of files: 1933) Average File Size (bytes): 4568935.69565217; Users: hive (total size: 735598647; number of files: 161) Average File Size (bytes): 209409621.25; Users: yarn (total size: 1675276970; number of files: Average File Size (bytes): 469965475; Users: tez (total size: 939930950; number of files: 2) Users with most small files: oozie: 1927 small files hue: 535 small files hbase: 218 small files hive: 161 small files livy: 16 small files hdfs: 13 small files yarn: 6 small files mapred: 4 small files spark: 3 small files impala: 2 small files The default block size is being used = 128 MB . This way we can use this script to identify the user having highest number of small file and make a decision based on cluster behaviour like which user is contributing more to small file issue if present. Reference : https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/

Navink · ‎05-03-2024

We will be using the Ozone IDB CLI tool to view the RocksDb, which stores all the information in various tables. Here I will be showing how to check the table data that it has created for the filesystem in Ozone. [root@ccycloud-1 data]# pwd /var/lib/hadoop-ozone/scm/data [root@ccycloud-1 data]# pwd /var/lib/hadoop-ozone/om/data We need to traverse to the Ozone metadata directory through Cloudera Manager UI > configuration and search for "metadata". [root@ccycloud-1 data]# ozone debug ldb --db=om.db ls default fileTable principalToAccessIdsTable deletedTable userTable s3SecretTable transactionInfoTable openKeyTable snapshotInfoTable directoryTable prefixTable multipartInfoTable volumeTable tenantStateTable deletedDirectoryTable tenantAccessIdTable openFileTable snapshotRenamedTable dTokenTable metaTable keyTable bucketTable Under the directory /var/lib/hadoop-ozone/scm/data we need need to ls to list out the tables. [root@ccycloud-1 data]# ozone debug ldb --db=scm.db ls default sequenceId revokedCertsV2 pipelines crls crlSequenceId meta containers validCerts validSCMCerts scmTransactionInfos deletedBlocks statefulServiceConfig move revokedCerts Use the column_family filter to see each of the tables under scm. db and om. db Example: Like for userTable under om.db use "--column_family=userTable": [root@ccycloud-1 data]# ozone debug ldb --db=om.db scan --column_family=userTable 24/01/19 09:19:28 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/01/19 09:19:28 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/01/19 09:19:28 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/01/19 09:19:28 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/01/19 09:19:28 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true { "hdfs": { "unknownFields": { "fields": {} }, "bitField0_": 3, "volumeNames_": [ "s3v" ], "objectID_": -4611686018427388160, "updateID_": 18014398509481983, "memoizedIsInitialized": 1, "memoizedSerializedSize": -1, "memoizedHashCode": 0, "memoizedSize": -1 } } Example: Like for container under scm.db use "--column_family=containers" [root@ccycloud-1 data]# ozone debug ldb --db=scm.db scan --column_family=containers 24/04/15 11:42:40 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/04/15 11:42:40 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/04/15 11:42:40 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/04/15 11:42:40 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true 24/04/15 11:42:40 INFO helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true { { "id": 1 }: { "state": "OPEN", "stateEnterTime": { "seconds": 1713164197, "nanos": 579000000 }, "pipelineID": { "id": "8d897f9a-0d35-42e1-b3a7-e010c56198f0" }, "replicationConfig": { "replicationFactor": "THREE" }, "clock": { "zone": { "totalSeconds": 0 } }, "usedBytes": 2772, "numberOfKeys": 44, "lastUsed": { "seconds": 1713181360, "nanos": 301000000 }, "owner": "om81", "containerID": { "id": 1 }, "deleteTransactionId": 267, "sequenceId": 514 }, { "id": 2 }: { "state": "OPEN", "stateEnterTime": { "seconds": 1713164247, "nanos": 612000000 }, "pipelineID": { "id": "3a045dc0-ad58-4710-8d06-70fa1b8b9f7f" }, "replicationConfig": { "replicationFactor": "THREE" }, "clock": { "zone": { "totalSeconds": 0 } }, "usedBytes": 2520, "numberOfKeys": 40, "lastUsed": { "seconds": 1713181360, "nanos": 307000000 }, "owner": "om81", "containerID": { "id": 2 }, "deleteTransactionId": 265, "sequenceId": 441 }, { "id": 3 }: { "state": "OPEN", "stateEnterTime": { "seconds": 1713164309, "nanos": 369000000 }, "pipelineID": { "id": "8d897f9a-0d35-42e1-b3a7-e010c56198f0" }, "replicationConfig": { "replicationFactor": "THREE" }, "clock": { "zone": { "totalSeconds": 0 } }, "usedBytes": 2772, "numberOfKeys": 44, "lastUsed": { "seconds": 1713181360, "nanos": 308000000 }, "owner": "om81", "containerID": { "id": 3 }, "deleteTransactionId": 268, "sequenceId": 518 }, { "id": 4 }: { "state": "OPEN", "stateEnterTime": { "seconds": 1713164366, "nanos": 431000000 }, "pipelineID": { "id": "3a045dc0-ad58-4710-8d06-70fa1b8b9f7f" }, "replicationConfig": { "replicationFactor": "THREE" }, "clock": { "zone": { "totalSeconds": 0 } }, "usedBytes": 2520, "numberOfKeys": 40, "lastUsed": { "seconds": 1713181360, "nanos": 310000000 }, "owner": "om81", "containerID": { "id": 4 }, "deleteTransactionId": 266, "sequenceId": 437 } }

Navink · ‎04-04-2024

In this article, I will demonstrate how to access the Ozone file system using Java API. I explain the steps to validate the Ozone cluster for volume, bucket, and key creation. Let's begin and create a Java project with Maven build: Create a Java project with the name AccessOFS using IntelliJ from File->New->Project. The project structure will look like the below image: Modify your pom.xml and add the dependencies for ozone-client. The pom.xml will be as follows. <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.demo</groupId> <artifactId>AccessOFS</artifactId> <version>1.0-SNAPSHOT</version> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>8</source> <target>8</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.10</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.1.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdds-common</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-ozone-filesystem-hadoop3</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-ozone-filesystem-hadoop2</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdds-container-service</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdds-client</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.ozone</groupId> <artifactId>ozone-client</artifactId> <version>1.4.0</version> </dependency> <dependency> <groupId>com.jayway.jsonpath</groupId> <artifactId>json-path</artifactId> <version>2.5.0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.7.10</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.10</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>jul-to-slf4j</artifactId> <version>1.7.10</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.10</version> <scope>compile</scope> </dependency> </dependencies> </project> I have added the ozone-client dependencies below. <dependency> <groupId>org.apache.ozone</groupId> <artifactId>ozone-client</artifactId> <version>1.4.0</version> </dependency> Create your AccessOzoneFS main class where we are trying to read the file from the local file system and store that in the ozone bucket: The AccessOzoneFS class will be as follows: package org.demo; import org.apache.commons.io.FileUtils; import org.apache.hadoop.hdds.conf.OzoneConfiguration; import org.apache.hadoop.ozone.client.*; import org.apache.hadoop.ozone.client.io.OzoneInputStream; import org.apache.hadoop.ozone.client.io.OzoneOutputStream; import org.apache.hadoop.security.UserGroupInformation; import java.io.File; public class AccessOzoneFS { public static void main(String[] args) { try { OzoneConfiguration ozoneConfiguration = new OzoneConfiguration(); String omLeaderAddress = args[0]; String omPrincipal = args[1]; String keytabPathLocal = args[2]; String volume = args[3]; String bucket = args[4]; String key = args[5]; String sourceFilePath = args[6]; Long dataSize = Long.parseLong(args[7]); //set om leader node ozoneConfiguration.set("ozone.om.address", omLeaderAddress); // Setting kerberos authentication ozoneConfiguration.set("ozone.om.kerberos.principal.pattern", "*"); ozoneConfiguration.set("ozone.security.enabled", "true"); ozoneConfiguration.set("hadoop.rpc.protection", "privacy"); ozoneConfiguration.set("hadoop.security.authentication", "kerberos"); ozoneConfiguration.set("hadoop.security.authorization", "true"); //Passing keytab for Authentication UserGroupInformation.setConfiguration(ozoneConfiguration); UserGroupInformation.loginUserFromKeytab(omPrincipal, keytabPathLocal); OzoneClient ozClient = OzoneClientFactory.getRpcClient(ozoneConfiguration); ObjectStore objectStore = ozClient.getObjectStore(); // Let us create a volume to store buckets. objectStore.createVolume(volume); // Let us verify that the volume got created. OzoneVolume assets = objectStore.getVolume(volume); // Let us create a bucket called bucket. assets.createBucket(bucket); OzoneBucket video = assets.getBucket(bucket); // read data from the file, this is assumed to be a user provided function. byte[] videoData = FileUtils.readFileToByteArray(new File(sourceFilePath)); // Create an output stream and write data. OzoneOutputStream videoStream = video.createKey(key, dataSize.longValue()); videoStream.write(videoData); // Close the stream when it is done. videoStream.close(); // We can use the same bucket to read the file that we just wrote, by creating an input Stream. // Let us allocate a byte array to hold the video first. byte[] data = new byte[(int) dataSize.longValue()]; OzoneInputStream introStream = video.readKey(key); introStream.read(data); // Close the stream when it is done. introStream.close(); // Close the client. ozClient.close(); } catch (Exception ex) { ex.printStackTrace(); } } } Now we have the project setup in IntelliJ. Let's try a maven build from the directory where there is pom.xml. AccessOFS % mvn clean install The build process will create a jar file under the project target directory which we will copy to the ozone cluster host under the /tmp directory. target % scp AccessOFS-1.0-SNAPSHOT.jar user@ozoneclusterHost:/tmp Create a file "demotest.mp4" of size 1 GB on the local file system under /tmp directory which you want to read and store under the ozone bucket. Now, execute the jar file from the ozone cluster host by passing the values for the below parameters. omLeaderAddress - OzoneManager leader address. omPrincipal - OzoneManager principal. keytabPathLocal - OzoneManager keytab file with full path. volume - OzoneManager volume name you want to create. bucket - OzoneManager bucket name you want to create. key - OzoneManager key name you want to create. sourceFilePath - File name with full path you want to read. dataSize - Size of the chunk. Use the below command and execute it from the ozone cluster host by passing suggested param values. (Please make sure your JDK directory is pointing correctly. ~]# /usr/java/jdk1.8.0_232-cloudera/bin/java -classpath /tmp/AccessOFS-1.0-SNAPSHOT.jar:`ozone classpath hadoop-ozone-ozone-manager` org.demo.AccessOzoneFS omLeaderAddress omPrincipal keytabPathLocal volume bucket key sourceFilepath dataSize Example: ~]# /usr/java/jdk1.8.0_232-cloudera/bin/java -classpath /tmp/AccessOFS-1.0-SNAPSHOT.jar:`ozone classpath hadoop-ozone-ozone-manager` org.demo.AccessOzoneFS 10.140.204.133 om/ccycloud-2.ni-717x-yn.root.comops.site@ROOT.COMOPS.SITE /var/run/cloudera-scm-agent/process/67-ozone-OZONE_MANAGER/ozone.keytab volume-sample bucket-sample key-sample /tmp/demotest.mp4 40749910 Now, let's validate volume, bucket, and key creation : Validate volume creation? ozone sh volume list o3://omLeaderHost:9862/ ]# ozone sh volume list o3://10.140.204.133:9862/ { "metadata" : { }, "name" : "s3v", "admin" : "om", "owner" : "om", "quotaInBytes" : -1, "quotaInNamespace" : -1, "usedNamespace" : 0, "creationTime" : "2024-04-02T08:21:12.980Z", "modificationTime" : "2024-04-02T08:21:12.980Z", "acls" : [ { "type" : "USER", "name" : "om", "aclScope" : "ACCESS", "aclList" : [ "ALL" ] } ] } { "metadata" : { }, "name" : "volume-sample", "admin" : "om/ccycloud-2.ni-717x-yn.root.comops.site@ROOT.COMOPS.SITE", "owner" : "om/ccycloud-2.ni-717x-yn.root.comops.site@ROOT.COMOPS.SITE", "quotaInBytes" : -1, "quotaInNamespace" : -1, "usedNamespace" : 1, "creationTime" : "2024-04-04T05:28:52.580Z", "modificationTime" : "2024-04-04T05:28:52.580Z", "acls" : [ { "type" : "USER", "name" : "om/ccycloud-2.ni-717x-yn.root.comops.site@ROOT.COMOPS.SITE", "aclScope" : "ACCESS", "aclList" : [ "ALL" ] }, { "type" : "GROUP", "name" : "om", "aclScope" : "ACCESS", "aclList" : [ "ALL" ] } ] } From the output, we noticed volume- "volume-sample" was created. Validate bucket creation? ozone sh bucket list o3://omLeaderHost:9862/volume-sample/ ~]# ozone sh bucket list o3://10.140.204.133:9862/volume-sample/ { "metadata" : { }, "volumeName" : "volume-sample", "name" : "bucket-sample", "storageType" : "DISK", "versioning" : false, "usedBytes" : 3221225472, "usedNamespace" : 1, "creationTime" : "2024-04-04T05:28:52.649Z", "modificationTime" : "2024-04-04T05:28:52.649Z", "quotaInBytes" : -1, "quotaInNamespace" : -1 } From the output, we noticed bucket- "bucket-sample" was created. Validate key creation? ozone sh key list o3://omLeaderHost:9862/volume-sample/bucket-sample/ ~]# ozone sh key list o3://10.140.204.133:9862/volume-sample/bucket-sample/ { "volumeName" : "volume-sample", "bucketName" : "bucket-sample", "name" : "key-sample", "dataSize" : 1073741824, "creationTime" : "2024-04-04T05:29:01.410Z", "modificationTime" : "2024-04-04T05:29:40.720Z", "replicationType" : "RATIS", "replicationFactor" : 3 } This way we can validate volume, bucket, and key creation using the AccessOFS jar which reads the file from the local file system and stores in the ozone bucket. References Java API Creating an Ozone client Reference SourceCode AccessOFS

Navink · ‎01-23-2024

Hi @parimalpatil The RedactorAppender is mostly you can ignore it is nothing to do with real failure unless the stacktraces at bottom points something related to any ozone roles. This Log4j Appender redacts log messages using redaction rules before delegating to other Appenders. You can share the complete failure log so that we can check and update you. The workaround is add jar file in classpath of roles where you see RedactorAppender error. We can add this through CM UI -> Configuration-> Search "role_env_safety_valve" for the role you are getting error. OZONE_CLASSPATH=$OZONE_CLASSPATH:/opt/cloudera/parcels/CDH/jars/logredactor-2.0.8.jar

Online	Offline
Last Visited	‎11-06-2024 04:30 AM

Member Since	‎02-12-2021 02:14 AM
Last Visited	‎11-06-2024 04:30 AM
Posts	143
Kudos received	1

Cloudera Community

How to identify in cdp cluster having small files ...

How to check the RocksDb data for Ozone using IDB...

How to access Ozone file system using Java API

Re: Spark Ozone Integration in CDP