About Harsh J

Harsh J · ‎03-05-2019

The CMS + Parallel New collector does a decent job for mid-to-high heaps. It continues to remain our default, although things may change with the introduction of newer JDK (LTS) support in future. G1 is a great new collector that's been improving since its inception, and if you are going to use it, I advise using the latest JDK8 version available. Most of the cases where we've had to recommend G1 GC over the stock defaults typically arise out of specific workload and heap pattern analysis. Are you facing long pauses with CMS + ParNew collectors? What are the pauses caused by, according to your GC logging (allocation failure? too low new-size? etc.)? Its worth measuring what's impacting the current heap collector configuration, as a simple switch will only bring in limited improvements that may not truly/automatically solve your existing problems.

Harsh J · ‎03-05-2019

You will be connecting into a remote cluster, so you require a machine that can run a browser and a terminal with stable internet connection. Checkout the exam specific FAQs here for more insight: https://www.cloudera.com/about/training/certification/faq.html#launch

Harsh J · ‎02-19-2019

> 1. What about connection-manager When there's no specialized connection manager, Sqoop will use its generic/standard one that's inbuilt. There's a chance this may be adequate. > 2. "driver" looks like > jdbc:mysql://:/ > So "mysql" is strange part of it. How to replace it? Your JDBC driver download should have come with a manual that shows how and what to use as its connection string. Searching over the web reveals http://support.sas.com/documentation/cdl/en/jdbcref/63713/HTML/default/viewer.htm#p1a7nrsg36gaf8n1j33tfwf7nsj7.htm which seems to suggest jdbc:sasiom://host:port (and other options). > 3. Where must i put drives (jar-files) to use it from sqoop? Place it directly under /var/lib/sqoop/, preferably on all hosts (if your Sqoop command may run from any host).

Harsh J · ‎02-15-2019

The share-lib in Oozie is modular, so it only adds necessary jars for each action type. The java action is the most generic of all action types, and therefore receives none of the other action type dependencies (such as hive, pig, distcp, spark, etc.). The article you've linked to carries an answer to the question of 'how do I further include jars from action type X into my action type Y', which I've quoted below for convenience: """ For example, if you want all Pig actions in one of your Workflows to include the HCatalog ShareLib, you would add oozie.action.sharelib.for.pig=pig,hcatalog to your job.properties. """ So in your case, you may want to try and add: oozie.action.sharelib.for.java=java,hive

Harsh J · ‎02-15-2019

If you are dealing with unordered partitioning from a data source, you can end up creating a lot of files in parallel as the partitioning is attempted. In HDFS, when a file (or more specifically, its block) is open, the DataNode performs a logical reservation of its target block size. So if your configured block size is 128 MiB, then every concurrently open block will deduct that value (logically) from the available remaining space the DataNode publishes to the NameNode. This reservation is done to help manage space and guarantees of a full block write to a client, so that a client that's begun writing its file never runs into an out of space exception mid-way. Note: When the file is closed, only the actual length is persisted, and the reservation calculation is adjusted to reflect the reality of used and available space. However, while the file block remains open, its always considered to be holding a full block size. The NameNode further will only select a DataNode for a write if it can guarantee full target block size. It will ignore any DataNodes it deems (based on its reported values and metrics) unfit for the requested write's parameters. Your error shows that the NameNode has stopped considering your only live DataNode when trying to allocate a new block request. As an example, 70 GiB of available space will prove insufficient if there will be more than 560 concurrent, open files (70 GiB divided into 128 MiB block sizes). So the DataNode will 'appear full' at the point of ~560 open files, and will no longer serve as a valid target for further file requests. It appears per your description of the insert that this is likely, as each of the 300 chunks of the dataset may still carry varied IDs, resulting in a lot of open files requested per parallel task, for insert into several different partitions. You could 'hack' your way around this by reducing the request block size within the query (set dfs.blocksize to 8 MiB for ex.), influencing the reservation calculation. However, this may not be a good idea for larger datasets as you scale, since it will drive up the file:block count and increase memory costs for the NameNode. A better way to approach this would be to perform a pre-partitioned insert (sort first by partition and then insert in a partitioned manner). Hive for example provides this as an option: hive.optimize.sort.dynamic.partition [1], and if you use plain Spark or MapReduce then their default strategy of partitioning does exactly this. [1] - https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition

Harsh J · ‎02-13-2019

The feature in C6.x is implicit and aimed to support easier rolling upgrades (when the job jars are part of the job exclusively, changes to locally installed binaries will not affect it during upgrades). A release note item is documented here for this: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cm_600_new_features.html#concept_qpj_jrq_v2b

Harsh J · ‎02-12-2019

> Does the latter overwrite the former for mapreduce applications? No, at least as of CDH 5.x, the two are additive. The yarn.application.classpath value goes on early (adding Common, HDFS and YARN), followed by mapreduce.application.classpath (adding just MR2). The reason they are separate is tied to another feature (available in CM 6.x) that lets you supply all framework jars as an archive along with the job rather than rely on local, pre-installed locations on all worker hosts that are subject to change anytime outside of a container's runtime. > There is also variable MR2_CLASSPATH that is included by default in mapreduce.application.classpath. Where is taken from? This is exclusive to Cloudera Manager managed environments, and is a reserved env-var name used to assist Parcels that may choose to supply some jars as 'plugins' to an app or a service. All such env-vars are listed here: https://github.com/cloudera/cm_ext/wiki/Plugin-parcel-environment-variables. In most cases you can ignore this env-var, as it will be empty usually. > Is the mapreduce.application.classpath relevant only for gateways from were application is submitted to yarn? No, the values are just variable names, and are not substituted at the gateway. They are substituted only on the NodeManager when the prepared container command/script actually executes. This lets you manage different install paths on different worker hosts, where local environments point to actual locations of jars.

Harsh J · ‎09-10-2018

The fencing config requirement still exists, and you could configure a valid fencer if you wish to, but with Journal Nodes involved you can simply use the following as your fencer, as the QJMs fence the NameNodes by crashing them due to a single elected writer model: <property> <name>dfs.ha.fencing.methods</name> <value>shell(/bin/true)</value> </property>

Harsh J · ‎09-08-2018

@phaothu, > My system have 2 datanode, 2 namenode, 3 journalnode, 3 zookeeper service To repeat, you need to run the ZKFailoverController daemons in addition to this setup. Please see the guide linked in my previous post and follow it entirely for the command-line setup. Running just ZK will not grant you a HDFS HA solution - you are missing a crucial daemon that interfaces between ZK and HDFS.

Harsh J · ‎09-05-2018

A HA HDFS installation requires you to run Failover Controllers on each of the NameNode, along with a ZooKeeper service. These controllers take care of transitioning NameNodes such that only one is active and the other becomes standby. It appears that you're using a CDH package based (non-CM) installation here, so please follow the guide starting at https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_hag_hdfs_ha_intro.html#topic_2_1_3__section_jnx_jzp_15, following instructions that are under the 'Command-Line' parts instead of Cloudera Manager ones. @phaothu wrote: But the problem is how start the namenode which I had stop again ? I do the following : sudo -u hdfs hdfs namenode -bootstrapStandby -force /etc/init.d/hadoop-hdfs-namenode start With above process sometime namenode start ok with standby mode , but sometime it start with active mode and then I have 2 active node (split brain !!) So what I have wrong , what is the right process to start a namenode had stop again Just simply start it up. The bootstrap command must only be run if its a fresh new NameNode, not every restart of a previously running NameNode. Its worth noting that Standby and Active are just states of the very same NameNode. The StandbyNameNode is not a special daemon, its just a state of the NameNode.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Using G1GC of JDK 8 on Cloudera

Re: Minimum hardware configuration for CCA 175 Exa...

Re: Sqoop: connect to random dbms?

Re: does sharelib jars needs to be explicitly incl...

Re: Why can't I partitioned a 1 gigabyte dataset i...

Re: difference between 'mapreduce.application.clas...

Re: difference between 'mapreduce.application.clas...

Re: Process to Start StandBy NameNode

Re: Process to Start StandBy NameNode

Re: Process to Start StandBy NameNode