Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
08-16-2016
09:56 PM
4 Kudos
I am a junkie for faster & cheaper data processing. Exactly why I love IaaS. My personal REAL WORLD experience with the typically IaaS providers has been generally slow on performance. Not to say hadoop/hbase/spark/etc jobs will not perform; however, you need to be familiar with what you're getting into and set realistic expectations. Recently I meet the IaaS vendor Their liquid metal offering which provides all the greatness which comes with bare metal on-prem installations but in the cloud. Options for bonded NICs & DAS had me at hello. I decided to run the same performance test I ran on AWS (article here) on bigstep. All the details of the scripts I ran are in that article. Just a quick note - these performance articles do not advocate for or against any specific IaaS provider. Nor does it reflect the HDP software. I simply want to run the repeatable processing test with near/similar IaaS hardware profiles and gather performance statistics. Interrupt the numbers as you wish. 1xMaster Node Hardware Profile CPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 128 GB DDR3 ECCLocal storage disks: 1 NVMEDisk size: 745 GBNetwork bandwidth: 40 gbps
3xData Nodes Hardware ProfileCPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 256 GB DDR3 ECCLocal storage disks: 12 HDDDisk size: 1863 GBNetwork bandwidth: 40 gbps Teragen results: 11 Mins 49 Secs I want to remain as objective as possible but WOW. That is simply one of the fastest teragen results I have ever seen. TeraSort results 51 Mins 12 secs Fastest I have seen on the cloud so far. On-prem with 1 additional node I was able to get it down to 40 mins. So 51 mins on 1 less nodes is pretty good. TeraValidate Results 4 mins 42 seconds This again was the faster performance I have seen on 1TB using teravalidate. I hope this helps with some basical insights into similar test I have performed so far on various IaaS providers. In the coming weeks/months I plan on publishing performance test result using azure and GCP.
It is extremely important to understand zero performance tweaking as been done. Nor does this reflect how HDP runs on IaaS providers. This does not reflect anything about the IaaS provider as well. I simply want to run with minimum tweaking teragen/terasort/teravalidate test, with same parameters, and similar hardware profiles and document results. That's it. Keep it simple.
... View more
Labels:
11-16-2017
10:07 PM
Hi Smanjee, I think you missed the right link here: "This is well documented here so no reason for me to regurgitate that."
... View more
07-20-2016
09:46 PM
6 Kudos
HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster
What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number:
What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it.
Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!
... View more
Labels:
07-09-2016
01:30 AM
6 Kudos
Short Description: Teragen and Terasort Performance testing on AWS Article This article should be used with extreme care. Do not use as benchmark. I performed this test to simply run a quick 1 Terabype teragen test on AWS to determine what type of performance I can get from mapreduce on AWS with VERY LITTLE configuration tweaking/tuning On my github page here you will find the following:
teragen script hadoop,yarn,mapred,capacity scheduler configurations used during testing Hardware: (Master & Datanode) 1 Master, 3 Data nodes d2.4xlarge, 16vCPU, 122GB ram, (max) 12x2000 Storage TeraGen Results: 1hrs, 6mins, 38sec Job Counters: Terasort Results: 1hrs, 34mins, 20sec Teravalidate Results: 25mins, 27sec
... View more
Labels:
07-06-2016
10:28 PM
5 Kudos
This tutorial will show how to export data out of hbase table into csv format. We will use airport data from american statical association available here. Assume you have a sandbox up and running lets start. First ssh into your sandbox and switch user to hdfs sudo su - hdfs Then grab the airport data by issues a wget wget http://stat-computing.org/dataexpo/2009/airports.csv For my example the file is located /home/hdfs/airports.csv Now lets create a hbase table called "airports" with column family "info". Do this in hbase shell Now that the table is created lets load it. Get out of hbase shell. as user hdfs run the following to load the table hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,info:iata,info:airport,info:city,info:country,info:lat,info:long" airports hdfs://sandbox.hortonworks.com:/tmp/airports.csv That will kick off map reduce job to load airport table in hbase. once that is done you can do a quick verify in hbase shell by running counts 'airports' You should see 3368 records in the table. Now lets log into pig shell. We will create a variable called airport_data which we will load our hbase table into by issuing: airport_data = LOAD 'hbase://airports'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:iata,info:airport,info:city,info:country,info:lat,info:long', '-loadKey true')
AS (iata,airport,city,country,lat,long); Now that we have our data in a variable lets dump it to hdfs using csv format by issuing: store airport_data into 'airportData/export' using PigStorage(','); So we have dumped the export into hdfs directory airportData/export. Lets go view it And there you go. We have loaded data into hbase table. Exported data from the table using pig in csv format. Happy pigging.
... View more
Labels:
07-08-2016
01:28 AM
@slachterman Good catch. fixed.
... View more
06-28-2016
10:19 PM
5 Kudos
How to get a docker image up and running which encapulates a PyCharm IDE integrated with spark and pybuilder. The IDE reside on the docker container and will be display on your laptop/machine. This is to isolate your development enviorment with has spark integrated with spark. Why? I am a spark developer and spend significant time trying to build a integrated environment. I am spending way too much time on integration before doing what I get paid to do --- Develop! Creating a isolated environment which is integrated with spark and a CIT, easily spun up and down, and repeatable is something which would accelerate my efficiency.
Download latest virtualbox from here. To run docker containers or build images a docker machine is required. Download docker machine from here. Download xQuartz to display the IDE on your laptop. View my docker page for information on the docker image here. Clone my PyCharm github repo. You are doing this bootstrap code sample code I have built to your docker container during launch. For example I performed git clone in my /Users/smanjee/docktest
git clone https://github.com/sunileman/pycharm.git To start this tutorial start docker machine in a new terminal. For example on my laptop here is the start script :/Applications/Docker/Docker*app/Contents/Resources/Scripts/start.sh Run docker-machine env to check the IP your machine is assigned (informational only) Pull the image docker pull sunileman/pycharm Build the image docker build -t sunileman/pycharm . Open another terminal and start port forwarding socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\" Get your IP address (not docker machines) Run the image docker run -it -v /tmp/.X11-unix/:/tmp/.X11-unix/ -v ~/docktest/pycharm/PycharmProjects:/root/PycharmProjects -v ~/docktest/pycharm/.Pycharm40:/root/.PyCharm40 -e DISPLAY=XXX.XX.XX.X:0 --rm sunileman/pycharm Replace XXX.xx.xx.x with your IP replace ~/docktest/pycharm/PycharmProjects with your path to pycharm which you downloaded from my github repo Replace ~/docktest/pycharm/.Pycharm40 with your path to pycharm which you downloaded from my github repo Click on I do not have previous versions Click on OK Click on OPEN to open the project you mounted to the docker container Find the PyCharm project to open
Now the project has been imported
So you have the project imported into your IDE which is running within the docker container. To prove the IDE is connected/integrated with spark simply run the python file and you will see spark modules have been imported
... View more
Labels:
05-24-2016
03:53 PM
@Constantin Stanca what do you mean by "Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely"? JVM start as they normally do to use this tool.
... View more
05-20-2016
03:20 PM
I will give this a try and I'll post the results. For Windows and DBVisualizer, there's an article with step by step details. DBVisualizer Windows For Tableau: http://kb.tableau.com/articles/knowledgebase/connecting-to-hive-server-2-in-secure-mode For Squirrel SQL: https://community.hortonworks.com/questions/17381/hive-with-dbvisualiser-or-squirrel-sql-client.html
... View more
- « Previous
- Next »