About Harsh J

Harsh J · ‎09-11-2018

Adding the Region Servers to the cluster itself, is fairly simple: Just use the Add Role Instances button in HBase -> Instances page, or the Add New Hosts button on Hosts page of Cloudera Manager. No configuration changes are required beyond this wizard process. In most situations, adding a new Region Server host also means adding a new Data Node host. So before you run the newly adding Region Servers, ensure you've first run and completed a HDFS Balancer to populate the DataNodes themselves (upto an acceptable threshold of equality, say, ±10%). When you start the new Region Servers, they begin empty and await region assignments from Master to send it regions. Existing regions in the cluster would need to get rebalanced across the newer cluster size for this to happen. The HBase Master runs the region balancer regularly, but only if there are no regions stuck in assignment (no regions in transition). Ensure via HBCK or via the active HMaster Web UI that there are no regions in transition before you start the newly adding Region Servers. Once you observe the HBase Regions Balancer kick in and complete (typical wait is ~5 minutes for it to begin, and you can follow along in the active HMaster Web UI), you should start seeing some regions being served by your new Region Servers and the average number of regions on the older Region Servers should slightly reduce. At this point the new servers' regions will each have a poor data locality (a value under 90-95% can be considered poor, especially for low-latency application-used tables), so it is worth running the major_compact HBase shell command on at least the most important tables in your environment.

Harsh J · ‎09-10-2018

Please open a new topic as your issue is unrelated to this topic. This helps keep issues separate and improves your search experience. Effectively your issue is that your YARN Resource Manager is either (1) down, due to a crash explained in the /var/log/hadoop-yarn/*.out files, or (2) not serving on the external address that quickstart.cloudera runs on, for which you need to ensure that 'nslookup $(hostname -f)' resolves to the external address in your VM and not localhost/127.0.0.1.

Harsh J · ‎09-10-2018

The fencing config requirement still exists, and you could configure a valid fencer if you wish to, but with Journal Nodes involved you can simply use the following as your fencer, as the QJMs fence the NameNodes by crashing them due to a single elected writer model: <property> <name>dfs.ha.fencing.methods</name> <value>shell(/bin/true)</value> </property>

Harsh J · ‎09-10-2018

Here's what I do to build Apache Oozie 5.x from CDH6 (6.0.0) via sources: ~> git clone https://github.com/cloudera/oozie.git ~> cd oozie/ && git checkout cdh6.0.0 ~> bin/mkdistro.sh -DskipTests -Puber … (takes ~15+ minutes if building for the first time) … ~> ls -lh distro/target/ # Look for oozie-5.0.0-cdh6.0.0-distro.tar.gz

Harsh J · ‎09-10-2018

> but getLocations() is fine, we are using fair scheduler, which should take getLocations() and schedule the job to according to data localituy right? I'm not sure I entirely follow. Do you mean to say you return constant values as part of getLocations() in your implementation of AccInputSplit? If that is so, then yes I believe it should work. In that case, could you share masked snippets from your implementation of toString() and getLocations() for a closer look? However, if your getLocations() is intended to be dynamic then you must absolutely and correctly implement the write/readFields serialization methods. This is because the resource requests are made by the Application Master after it reads and understands the splits file prepared by the client (client calls write to serialize the split objects, AM calls readFields to deserialize them). If your readFields is a dummy method, then the objects constructed in the AM runtime will not carry all the data you intended it to carry from the client end.

Harsh J · ‎09-09-2018

Postgres is sensitive to how you connect to it. You should be using the exact address it listens on, as only that will be allowed by the default configuration. Your command carries an IP that's 0.0.0.0. While am uncertain if you've masked it or if you are truly using a wildcard-designate IP for a server address, you should ideally be using the exact hostname/IP that the Postgres service is listening on. Your Postgres service's configuration will carry the listening_address entry in it that designates this. Take a look at this thread over at Stack Exchange: https://dba.stackexchange.com/a/84002

Harsh J · ‎09-09-2018

Like the error notes, support for writing from a stream to a JDBC sink is not present in Spark yet: https://issues.apache.org/jira/browse/SPARK-19478 Take a look at this past thread where an alternative, more direct approach is discussed: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Streaming-save-output-to-mysql-DB/td-p/25607

Harsh J · ‎09-09-2018

> Soft memory limit exceeded This is classically caused by exhaustion of the memory granted to Kudu via Kudu - Configuration - Kudu Tablet Server Hard Memory Limit property. What is it set to, and have you had this repeat after raising it? > I'm also seeing things in the logs about removing servers from a tablet's cache and WebSocket queue is full, discarding 'status' message. Could you share some of these log snippets so we can analyse them more specifically? They don't sound directly related to your issue, so having the full log lines would help ascertain if they are the cause behind the server-side rejection of the inserts.

Harsh J · ‎09-08-2018

@phaothu, > My system have 2 datanode, 2 namenode, 3 journalnode, 3 zookeeper service To repeat, you need to run the ZKFailoverController daemons in addition to this setup. Please see the guide linked in my previous post and follow it entirely for the command-line setup. Running just ZK will not grant you a HDFS HA solution - you are missing a crucial daemon that interfaces between ZK and HDFS.

Harsh J · ‎09-07-2018

Are you certain your custom split class' readFields method is initialising the locations correctly when deserialising? I can only guess at what's wrong for this specific situation without the relevant source bits, sorry. The tasks all receive the same splits file you've inspected. Does a local job runner test work fine?

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: HBase compaction while adding region servers

Re: unable to import data from mysql to sqoop to H...

Re: Process to Start StandBy NameNode

Re: Education - building Oozie

Re: MR job Split locations null

Re: problems using postgresql and sqoop

Re: error writing data from spark streaming to pos...

Re: Getting Row Error for Primary key error on kud...

Re: Process to Start StandBy NameNode

Re: MR job Split locations null