About nmaillard1

nmaillard1 · ‎02-07-2016

In the big data and distributed system world you may have done a world class job in dev and Unit testing, chances are you still did it on sample data moreover in dev you might not have done it in a truly distributed system but more on one machine simulating distribution. Hence on the prod cluster you could rely on logs and sometime you would also like to connect a remote debugger and other tools you are used to. In this blog post I will go over an example in Java and Eclipse. Partly because I use those and also because I see mostly scala examples so let’s give java a little love. Setting up I have installed a Hortonworks sandbox for this exercise, you can donwload it here. You will also need to open up a port to bind on, on the last post I used port 7777 i’ll stick with this here as well. First we will use a very standard wordcount example you can get in all the spark tutorials. Set up a quick maven project to bundle your jar and dependent libs you might be using. Notice the breakpoint on the 26th line, it will make sense later on. Once your code is done, your unit tests have passed and you are ready to deploy to the cluster, let’s go ahead an build it and push it out the cluster. Spark deployment mode Spark has three modes Standalone,Yarn,Mesos. Most examples talk about standalone so I will focus on Yarn. In yarn every application has an application master process where the first container for the application is started, it responsible for requesting ressources for the application and driving the application as a whole. For spark this means you get to choose if the yarn application master runs the whole application and hence the spark driver or if your client is active and keeps the spark driver. Remote debug launch Now that your jar is on the cluster and you are ready todebug you need to submit your spark job with debug options so your IDE can bind to it. Depending on your Spark version there are different ways to go about this, I am in spark > 1.0 so I will use: Notice address=7777, back to my port I talked about earlier and suspend=y to have the process wait for my IDE to bind. let’s now launch our normal spark submit command If we look at the line a little bit closer I have specified low memory settings for my specific sandbox context and given an input file Hortonworks and an outputfile sparkyarn11.log. Notice we are using here the –master yarn-client deployment mode. As the prompt shows us the system is now waiting on port 7777 for the IDE to bind. IDE Now back in our IDE we will use the remote debug function to bind to our spark cluster. Once you run debug the prompt on your cluster will start showing your IDE bind and you can now run your work from Eclipse directly Back in Eclipse you will be prompted to the debug view and you can go ahead and skip from breakpoint to breakpoint. Remember the breakpoint I pointed out in the code. Your IDE is on this breakpoint waiting to move one. On the right variable panel you can see the file input variable we get from the command line, it is set tohdfs://sandbox.hortonworks.com/user/guest/Hortonworks exactly like our command line input. Great we have just setup our first remote debug on Spark, you should go ahead and try it with–master yarn-cluster and see what it changes.

nmaillard1 · ‎01-14-2016

Hello vijaya I suppose you will most of the information you need on how to set it up and configure it in our documentatiohttp://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_views_guide/content/ch_using_ambari_views.html of course this documentation also depends on your specific version so please check it before using the link.

nmaillard1 · ‎01-13-2016

Hello aaron Jeff sposetti Ambari PM mentionned : "we made "non-hdfs" clusters (i.e. storm or kafka-only clusters) possible in Ambari 2.1.x" So a zookeeper only minimum viable cluster and storm is possible since Ambari 2.1.X

nmaillard1 · ‎12-14-2015

Hey Andrew I am not aware of any "hard" limitation in hive in regards to column count, there are some on column size though. This being said a restriction on column count would also probably depend on the file format, ORC having indexes and predicate pushdown does not behave as a Text file would. ORC has configurations for number of rows that are grouped together for an index. In Hive issue: https://issues.apache.org/jira/browse/HIVE-7250 for example the number of columns > 1K created memory pressure in ORC resulting in OOM. In test 15K columns were loaded and saw OOM only at 20K columns. Regarding the Hbase/phoenix tradeoff, I would not base my decision on this metric. Hbase and Phoenix really shine on rowkey look ups with a custom built rowkey, by custom I mean containing some of your search logic. Regardless of having 1 or 1 million columns if you force hbase to do a full scan on millions of rows performance will not shine. In this scenario the granular lookup scenario is more important than number of columns. Once you have defined you can search on a rowkey logic, then yes the number of columns and performance tradeoff can be delt with differently in Hbase/Phoenix, using column families, filters etc... If you were however to put a Hive external table on it you would come back to your initial situation.

nmaillard1 · ‎11-02-2015

Hey You can Bulk load into Hbase in several different manners.The importTsv tool has been out there for a while. However if your data is in ORC with a HIve table on top the Hive bulk load is an easier option with less moving parts. This slide from nick has a lot of info http://fr.slideshare.net/HBaseCon/ecosystem-session-3a, slide 12 is the one you want to look at. Essentially set hive.hbase.generatehfiles=true set hfile.family.path=/tmp/somewhere (this can also be a property) this allows you to do insert into with the result of a sql statement a little more agile then having to go down the csv way. Careful the Hbase user will be picking up the generated files.

nmaillard1 · ‎11-01-2015

First thing to take into account when using a coprocessor is that they can break your hbase in case of error so this configuration can help you: hbase.coprocessor.abortonerror setting it to false will allow to still start your hbase cluster. When you decide to use coprocessors: - hbase.coprocessor.enabled: Enables or disables coprocessor loading, for master loading - hbase.coprocessor.user.enabled: Enables or disables coprocessor loading from shell aka user loading in table creation. - hbase.coprocessor.region.classes: A comma-separatedlist of Coprocessors that are loaded by default on alltables These settings set ground rules of how coprocessors can be used it might be a good decision to be restrictive in how they can be added. Once we have looked at all of these a coprocessor is an extension of the Hbase cluster or table functionnalities so no extra security on top of the standard Hbase security. You are however allowed to put your own logic in the coprocessor if it makes sense. A coprocessor at a higher priority may preempt action by those at lower priority by throwing an IOException (or a subclass of this). The coprocessor blog has an example of an acces-control coprocessor: https://blogs.apache.org/hbase/entry/coprocessor_introduction. If you are enclined to build more access logic this is a good starting point. hope this helps

nmaillard1 · ‎10-17-2015

Is this for all RSs or just one "abnormal" one. Are you sure there are no bad disks, or all standard OS tunings have been done. this jira ticket was meant to help investigate these issues https://issues.apache.org/jira/browse/HBASE-11240

nmaillard1 · ‎10-17-2015

Hey this is a tough one Hdfs is not big on a distributed cluster for many reasons like latency, data transport for all the nodes etc... Hbase "distributed" beeing on Hdfs gets the same issues and a couple others. Add to this HDFS replicates data around the cluster including Hbase data so respecting "region policies" would not work, in some situations Hbase data is not even local to the region server servicing it. So as you can see guaranteeing a custom specific locality of data in Hadoop cluster is no easy feat.

nmaillard1 · ‎10-14-2015

Would paged queries be a solution https://phoenix.apache.org/paged.html

nmaillard1 · ‎10-11-2015

Hey Wes These are a good start. The Hbase SME wiki has a couple of links as well for Hbase tuning and staffing. this link also has info: http://fr.slideshare.net/lhofhansl/h-base-tuninghbasecon2015ok Then there are element specific thoughts like cache algos, number versions, number hfiles etc.. that can vary depending in your usage.

Online	Offline
Last Visited	‎10-17-2018 10:48 AM

Member Since	‎09-17-2015 07:33 PM
Last Visited	‎10-17-2018 10:48 AM
Posts	70
Kudos received	79

Cloudera Community

Re: To what extent is schema evolution available i...

Re: CONTROL SIZE OUTPUT FILE SIZE WITHOUT ADDING M...

Re: how to compute regionserver's normal region co...

Re: What I need to check if Job taking 15-20 min m...

Re: Integration between Apache Pig, Apache Nifi an...

Spark Remote Debugging

Re: How to implement standalone Ambari views and c...

Re: Can Ambari manage a cluster of only zookeeper ...

Re: Limitations on # of Hive Columns

Re: Loading HBase from Hive ORC Tables

Re: HBase Coprocessor and security

Re: HBase regionserver reporting wal.FSHLog: Slow ...

Re: Geographically Distributed HBase

Re: Pagination is not supported in Phoenix 4.2. I...

Re: What is the Best consolidated Guide for HBase ...