About Harsh J

Harsh J · ‎04-16-2018

Record counting depends on understanding the format of the file (text, avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic and store absolutely zero information beyond the file size (as to file's contents). To find record counts, you will need to query the files directly with a program suited to read such files. If they are simple text files, a very trivial example would be 'hadoop fs -text FILE_URI | wc -l'. This of course does not scale for a large group of files as it is single threaded - you'd ideally want to use MR or Spark to generate the counts quicker. Another trick to think of for speed: Parquet files carry a footer area with stats about the written file and can give you record counts without having to read the whole file: https://github.com/apache/parquet-format#metadata and https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend, but this does not apply to all file formats.

Harsh J · ‎04-15-2018

There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones. Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements. I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick. [1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.

Harsh J · ‎04-10-2018

> Which "request"? What is this "request"? What does it contain? Why is it 76026807? A request in this context is a basic call from a HDFS client. A few example requests from a client could be "list this directory", "create a file", etc. The request type is not determined at the point where the error is thrown because the 64 MB length limit safety check fires before we can deserialize/interpret the request. As to what it contains, its not clear from the error. What is odd definitely is its size. Most client requests carry very simple attributes in them, such as a path, a list of locations, a flag and so on. Nothing in a regular client's request should be this large, unless perhaps the client in question is using ginormously large paths. In the other cases I've seen this message, the port of request is usually 8022 which is where DataNodes send their heartbeats and block reports. These sort of 'requests' can be large depending on the amount of blocks or other datasets being sent. Assuming you are running a configuration that uses 8020 + 8022 both, it is quite odd to observe this error over 8020. It could be a rogue client such as a network scanner sending bogus or specially crafted data for vulnerability checks (in which case this is normal to see, and is acting as designed in rejecting such requests). You can find out more by trying to spot the program running on the client IP shown in the error, and see what form of API calls its trying to make (or if it even is a valid client).

Harsh J · ‎04-03-2018

To be precise, the issues will appear on the DataNode due to parallel use of the disks by NodeManager and other daemons sharing the host (and disk mount paths). The NameNode by itself keeps track of how much space the DataNode has and avoids full DNs if they cannot accommodate an entire block size, and a host of other checks (such as load average, recency of heartbeats, etc.): https://github.com/cloudera/hadoop-common/blob/cdh5.14.0-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L623-L645 and https://github.com/cloudera/hadoop-common/blob/cdh5.14.0-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java#L808-L851 Additionally the DataNodes "hide" the configured reserved space that HDFS never considers available (and hits its "deselect" DN criteria well before the disk fills up). However, keep in mind that you may be running YARN's NodeManagers on the same set of disks (on different directories). These carry their own usage and selection policies. A rogue app can cause a disk to get temporarily filled up very quickly, compared to your regular HDFS write rates. This can cause the DNs to suddenly find themselves lacking space in middle of a write, but will probably appear OK later. You'll ideally need to ensure YARN too marks its NodeManagers as 'unhealthy' and not assign more tasks when such things happen - this can be done with the NodeManager health check script features. All this said, HDFS clients will optimistically retry their work on the same DN or on another DN (or even remaining DNs in the immediate replicated pipeline) should they run into a space caused issue, and will try not to let the writing application fail unless its a very extreme scenario.

Harsh J · ‎04-02-2018

The commitId here references the source commit ID from which the Kafka jar was built. It does not reference any Kafka usage related terms such as 'commit offsets' or other terms. The reason it appears as "unknown" is tied to the way we build it (outside of a repository). The field being unknown does not affect the Kafka client's functionality in any manner. Are you facing an issue with your Kafka clients? Is there another error or behaviour you observe that is breaking your app?

Harsh J · ‎03-26-2018

Thank you for clarifying! When recreating the log directories, ensure its ownership is "cloudera-scm" user, so the service can write into it. Once the logging is available, it can be referenced for further diagnosis. Also, please do not change the db.properties unless you're actually migrating to a new DB. Try a revert back to the old config after confirming that the "scm" Postgres DB on port 7342 is live and available (via psql commands/etc.).

Harsh J · ‎03-26-2018

To troubleshoot your cloudera-scm-server-db startup, try taking a look at the log files postgresql writes under the following paths: /var/log/cloudera-scm-server-db/* and, /var/lib/cloudera-scm-server-db/* I'd also strongly recommend migrating to a managed RDBMS so this is straightforward in future.

Harsh J · ‎03-26-2018

I am not sure what 'FA' means in this context, but CM APIs offer license information via the following REST end-points: http://cloudera.github.io/cm_api/apidocs/v18/path__cm_license.html http://cloudera.github.io/cm_api/apidocs/v18/path__cm_licensedFeatureUsage.html An example query: # For licensee details and UIDs: ~> curl -u admin http://cm-hostname.organization.com:7180/api/v18/cm/license … { "owner" : "Licensee Name", "uuid" : "12345678-abcd-1234-abcd-1234abcd1234" } # For license usage details: ~> curl -u admin http://cm-hostname.organization.com:7180/api/v18/cm/licensedFeatureUsage … { "totals" : { "Core" : 8, "HBase" : 8, "Impala" : 8, "Search" : 5, "Spark" : 3, "Accumulo" : 4, "Navigator" : 8 }, "clusters" : { "Cluster 1" : { "Core" : 4, "HBase" : 4, "Impala" : 4, "Search" : 4, "Spark" : 2, "Accumulo" : 4, "Navigator" : 4 }, "Cluster 2" : { "Core" : 4, "HBase" : 4, "Impala" : 4, "Search" : 1, "Spark" : 1, "Accumulo" : 0, "Navigator" : 4 } } }

Harsh J · ‎03-26-2018

Are you facing this in a CDH QuickStart VM, or in a Cloudera Manager installation of CDH? For manual Apache HBase tarball setups, following the upstream guide to the dot would ensure you get a working environment. Specifically for @UjjwalRana's issue, merely setting up HBase is inadequate. First setup ZK that HBase relies on, ensure it works (via zkCli.sh, or zookeeper-client commands) and then point HBase configuration to it.

Harsh J · ‎03-23-2018

Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug. You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: HDFS File Record Counts

Re: Best practice for Hive actions inside Oozie

Re: ISSUE: Requested data length 76026807 is longe...

Re: hdfs: max utilization on a single disk

Re: Kafka commitid : Unknown

Re: Can not start cloudera-scm-server-db because o...

Re: Can not start cloudera-scm-server-db because o...

Re: Cloudera Licence details using CLI

Re: zookeeper.RecoverableZooKeeper: ZooKeeper exis...

Re: Yarn Resource Manager Halts with java.lang.Ou...