Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1969 | 07-09-2019 12:53 AM | |
| 11881 | 06-23-2019 08:37 PM | |
| 9147 | 06-18-2019 11:28 PM | |
| 10136 | 05-23-2019 08:46 PM | |
| 4581 | 05-20-2019 01:14 AM |
04-16-2018
07:09 AM
Record counting depends on understanding the format of the file (text, avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic and store absolutely zero information beyond the file size (as to file's contents). To find record counts, you will need to query the files directly with a program suited to read such files. If they are simple text files, a very trivial example would be 'hadoop fs -text FILE_URI | wc -l'. This of course does not scale for a large group of files as it is single threaded - you'd ideally want to use MR or Spark to generate the counts quicker. Another trick to think of for speed: Parquet files carry a footer area with stats about the written file and can give you record counts without having to read the whole file: https://github.com/apache/parquet-format#metadata and https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend, but this does not apply to all file formats.
... View more
04-15-2018
06:31 PM
1 Kudo
There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones. Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements. I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick. [1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.
... View more
04-10-2018
02:53 AM
1 Kudo
> Which "request"? What is this "request"? What does it contain? Why is it 76026807? A request in this context is a basic call from a HDFS client. A few example requests from a client could be "list this directory", "create a file", etc. The request type is not determined at the point where the error is thrown because the 64 MB length limit safety check fires before we can deserialize/interpret the request. As to what it contains, its not clear from the error. What is odd definitely is its size. Most client requests carry very simple attributes in them, such as a path, a list of locations, a flag and so on. Nothing in a regular client's request should be this large, unless perhaps the client in question is using ginormously large paths. In the other cases I've seen this message, the port of request is usually 8022 which is where DataNodes send their heartbeats and block reports. These sort of 'requests' can be large depending on the amount of blocks or other datasets being sent. Assuming you are running a configuration that uses 8020 + 8022 both, it is quite odd to observe this error over 8020. It could be a rogue client such as a network scanner sending bogus or specially crafted data for vulnerability checks (in which case this is normal to see, and is acting as designed in rejecting such requests). You can find out more by trying to spot the program running on the client IP shown in the error, and see what form of API calls its trying to make (or if it even is a valid client).
... View more
04-03-2018
05:20 AM
2 Kudos
To be precise, the issues will appear on the DataNode due to parallel use of the disks by NodeManager and other daemons sharing the host (and disk mount paths). The NameNode by itself keeps track of how much space the DataNode has and avoids full DNs if they cannot accommodate an entire block size, and a host of other checks (such as load average, recency of heartbeats, etc.): https://github.com/cloudera/hadoop-common/blob/cdh5.14.0-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L623-L645 and https://github.com/cloudera/hadoop-common/blob/cdh5.14.0-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java#L808-L851 Additionally the DataNodes "hide" the configured reserved space that HDFS never considers available (and hits its "deselect" DN criteria well before the disk fills up). However, keep in mind that you may be running YARN's NodeManagers on the same set of disks (on different directories). These carry their own usage and selection policies. A rogue app can cause a disk to get temporarily filled up very quickly, compared to your regular HDFS write rates. This can cause the DNs to suddenly find themselves lacking space in middle of a write, but will probably appear OK later. You'll ideally need to ensure YARN too marks its NodeManagers as 'unhealthy' and not assign more tasks when such things happen - this can be done with the NodeManager health check script features. All this said, HDFS clients will optimistically retry their work on the same DN or on another DN (or even remaining DNs in the immediate replicated pipeline) should they run into a space caused issue, and will try not to let the writing application fail unless its a very extreme scenario.
... View more
04-02-2018
01:53 AM
The commitId here references the source commit ID from which the Kafka jar was built. It does not reference any Kafka usage related terms such as 'commit offsets' or other terms. The reason it appears as "unknown" is tied to the way we build it (outside of a repository). The field being unknown does not affect the Kafka client's functionality in any manner. Are you facing an issue with your Kafka clients? Is there another error or behaviour you observe that is breaking your app?
... View more
03-26-2018
11:22 PM
1 Kudo
Thank you for clarifying! When recreating the log directories, ensure its ownership is "cloudera-scm" user, so the service can write into it. Once the logging is available, it can be referenced for further diagnosis. Also, please do not change the db.properties unless you're actually migrating to a new DB. Try a revert back to the old config after confirming that the "scm" Postgres DB on port 7342 is live and available (via psql commands/etc.).
... View more
03-26-2018
07:05 PM
1 Kudo
To troubleshoot your cloudera-scm-server-db startup, try taking a look at the log files postgresql writes under the following paths: /var/log/cloudera-scm-server-db/* and, /var/lib/cloudera-scm-server-db/* I'd also strongly recommend migrating to a managed RDBMS so this is straightforward in future.
... View more
03-26-2018
06:48 PM
I am not sure what 'FA' means in this context, but CM APIs offer license information via the following REST end-points: http://cloudera.github.io/cm_api/apidocs/v18/path__cm_license.html http://cloudera.github.io/cm_api/apidocs/v18/path__cm_licensedFeatureUsage.html An example query: # For licensee details and UIDs: ~> curl -u admin http://cm-hostname.organization.com:7180/api/v18/cm/license … { "owner" : "Licensee Name", "uuid" : "12345678-abcd-1234-abcd-1234abcd1234" } # For license usage details: ~> curl -u admin http://cm-hostname.organization.com:7180/api/v18/cm/licensedFeatureUsage … { "totals" : { "Core" : 8, "HBase" : 8, "Impala" : 8, "Search" : 5, "Spark" : 3, "Accumulo" : 4, "Navigator" : 8 }, "clusters" : { "Cluster 1" : { "Core" : 4, "HBase" : 4, "Impala" : 4, "Search" : 4, "Spark" : 2, "Accumulo" : 4, "Navigator" : 4 }, "Cluster 2" : { "Core" : 4, "HBase" : 4, "Impala" : 4, "Search" : 1, "Spark" : 1, "Accumulo" : 0, "Navigator" : 4 } } }
... View more
03-26-2018
06:30 PM
Are you facing this in a CDH QuickStart VM, or in a Cloudera Manager installation of CDH? For manual Apache HBase tarball setups, following the upstream guide to the dot would ensure you get a working environment. Specifically for @UjjwalRana's issue, merely setting up HBase is inadequate. First setup ZK that HBase relies on, ensure it works (via zkCli.sh, or zookeeper-client commands) and then point HBase configuration to it.
... View more
03-23-2018
11:30 PM
1 Kudo
Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug. You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.
... View more