Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 9930 | 07-20-2016 07:06 PM |
09-13-2016
08:52 AM
4 Kudos
I often hear stories of wanting faster performance from Hadoop & spark without knowing basic statistics within ones environment. One of the first questions I ask is whether the hardware can perform at the level which is being expected. The software is still bound to the physics of the hardware. If your IO disk speed is 10MB per sec, Hadoop/Spark nor any other software will magically make that disk speed faster. Again we are bound to the physical limits of the hardware we choose. What makes Hadoop and other distributed processing engines amazing is its ability to add more "cheap" nodes to the cluster to increase performance. However we should be aware the maximum throughput per node. This will help level set expectations before committing to any SLA bound to performance. Typically I love to use the sysbench tool. SysBench is a modular & multi-threaded benchmark tool
for evaluating OS parameters ie. CPU, ram, IO, and mutex. I use sysbench prior to installing any software outside the kernel and pre/post Hadoop/Spark upgrades. Pre/post upgrades should not have any impact to your OS benchmarks but I play it safe. My neck is on the line when I commit to a SLA so I rather play it safe. The below tests I generally wrap in a shell script for ease of execution. For this article I call out each test for clarification. RAM test I start with testing RAM performance. This test can be used to benchmark sequential memory reads or writes. I test both. To test read performance I set memory block size to HDFS block size, number-threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=read --num-threads=4 --memory-total-size=10G run To test write performance I set memory block size to HDFS block size, number of threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=write --num-threads=4 --memory-total-size=10G run CPU test Next I grab the CPU performance numbers. This test consists in calculation of prime numbers up to a value specified
by the --cpu-max-primes option. I set the number of threads = approx concurrency you expect on your cluster. sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run IO test Lastly I fetch the IO performance numbers. When using fileio, you will need to create a set of test files to work on. It is recommended that the size is larger than the available memory to ensure that file caching does not influence the workload too much - https://wiki.gentoo.org/wiki/Sysbench#Using_the_fileio_workload Run this command to prepare a file which is larger then the available memory (Ram) on the box. In this example my box has 128GB of ram. I set the file size to 150G. I named the file here fileio. sysbench --test=fileio --file-total-size=150G prepare Next I run the io test using the file I just created (fileio). file-test-mode is the type of workload to produce. Possible values: seqwr
sequential write
seqrewr
sequential rewrite
seqrd
sequential read
rndrd
random read
rndwr
random write
rndrw
combined random read/write
init-rng - specifies if random numbers generator
should be initialized from timer before the
test start - http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf max-time - is the limit for the total execution time in seconds. 0 means unlimited. be careful. set a limit. max-request - is the limit for the total request. 0 means unlimited sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
... View more
Labels:
08-25-2016
02:58 PM
@kishore sanchina you will need to use a protocol. If you simply want to "push" local files to nifi, you can use the ListenHTTP processor. Then simply curl the file to nifi.
... View more
08-25-2016
02:08 PM
6 Kudos
In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out. Started the testing with dfsio which is a distributed IO benchmark tool. Here are results: From 3 to 5 data nodes IO read performance increased approx. 36% From 3 to 5 data nodes IO write performance increased approx. 49% With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures! Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data. From 3 to 5 data nodes TeraGen performance increased approx. 65% Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order. From 3 to 5 data nodes TeraSort performance increased approx. 54%. Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted. From 3 to 5 data nodes TeraValidate performance increased approx. 64%. DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations. All my test results are here.
... View more
Labels:
08-16-2016
09:56 PM
4 Kudos
I am a junkie for faster & cheaper data processing. Exactly why I love IaaS. My personal REAL WORLD experience with the typically IaaS providers has been generally slow on performance. Not to say hadoop/hbase/spark/etc jobs will not perform; however, you need to be familiar with what you're getting into and set realistic expectations. Recently I meet the IaaS vendor Their liquid metal offering which provides all the greatness which comes with bare metal on-prem installations but in the cloud. Options for bonded NICs & DAS had me at hello. I decided to run the same performance test I ran on AWS (article here) on bigstep. All the details of the scripts I ran are in that article. Just a quick note - these performance articles do not advocate for or against any specific IaaS provider. Nor does it reflect the HDP software. I simply want to run the repeatable processing test with near/similar IaaS hardware profiles and gather performance statistics. Interrupt the numbers as you wish. 1xMaster Node Hardware Profile CPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 128 GB DDR3 ECCLocal storage disks: 1 NVMEDisk size: 745 GBNetwork bandwidth: 40 gbps
3xData Nodes Hardware ProfileCPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 256 GB DDR3 ECCLocal storage disks: 12 HDDDisk size: 1863 GBNetwork bandwidth: 40 gbps Teragen results: 11 Mins 49 Secs I want to remain as objective as possible but WOW. That is simply one of the fastest teragen results I have ever seen. TeraSort results 51 Mins 12 secs Fastest I have seen on the cloud so far. On-prem with 1 additional node I was able to get it down to 40 mins. So 51 mins on 1 less nodes is pretty good. TeraValidate Results 4 mins 42 seconds This again was the faster performance I have seen on 1TB using teravalidate. I hope this helps with some basical insights into similar test I have performed so far on various IaaS providers. In the coming weeks/months I plan on publishing performance test result using azure and GCP.
It is extremely important to understand zero performance tweaking as been done. Nor does this reflect how HDP runs on IaaS providers. This does not reflect anything about the IaaS provider as well. I simply want to run with minimum tweaking teragen/terasort/teravalidate test, with same parameters, and similar hardware profiles and document results. That's it. Keep it simple.
... View more
Labels:
11-16-2017
10:07 PM
Hi Smanjee, I think you missed the right link here: "This is well documented here so no reason for me to regurgitate that."
... View more
07-20-2016
09:46 PM
6 Kudos
HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster
What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number:
What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it.
Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!
... View more
Labels:
07-22-2016
07:05 AM
you can check this $hadoop dfsadmin -report as below . You can check without "ROOT" user as well. :~> hadoop dfsadmin -report Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it. Configured Capacity: 95930 (87.50 TB)
Present Capacity: 95869819 (87.93 TB) DFS Remaining: 37094235 (33.37 TB) DFS Used: 587755833 (53.56 TB) DFS Used%: 61.31%
Under replicated blocks: 0 Blocks with corrupt replicas: 5 Missing blocks: 0
------------------------------------------------- report: Access denied for user "username". Superuser privilege is required
:~>
... View more
07-09-2016
01:30 AM
6 Kudos
Short Description: Teragen and Terasort Performance testing on AWS Article This article should be used with extreme care. Do not use as benchmark. I performed this test to simply run a quick 1 Terabype teragen test on AWS to determine what type of performance I can get from mapreduce on AWS with VERY LITTLE configuration tweaking/tuning On my github page here you will find the following:
teragen script hadoop,yarn,mapred,capacity scheduler configurations used during testing Hardware: (Master & Datanode) 1 Master, 3 Data nodes d2.4xlarge, 16vCPU, 122GB ram, (max) 12x2000 Storage TeraGen Results: 1hrs, 6mins, 38sec Job Counters: Terasort Results: 1hrs, 34mins, 20sec Teravalidate Results: 25mins, 27sec
... View more
Labels:
07-06-2016
10:28 PM
5 Kudos
This tutorial will show how to export data out of hbase table into csv format. We will use airport data from american statical association available here. Assume you have a sandbox up and running lets start. First ssh into your sandbox and switch user to hdfs sudo su - hdfs Then grab the airport data by issues a wget wget http://stat-computing.org/dataexpo/2009/airports.csv For my example the file is located /home/hdfs/airports.csv Now lets create a hbase table called "airports" with column family "info". Do this in hbase shell Now that the table is created lets load it. Get out of hbase shell. as user hdfs run the following to load the table hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,info:iata,info:airport,info:city,info:country,info:lat,info:long" airports hdfs://sandbox.hortonworks.com:/tmp/airports.csv That will kick off map reduce job to load airport table in hbase. once that is done you can do a quick verify in hbase shell by running counts 'airports' You should see 3368 records in the table. Now lets log into pig shell. We will create a variable called airport_data which we will load our hbase table into by issuing: airport_data = LOAD 'hbase://airports'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:iata,info:airport,info:city,info:country,info:lat,info:long', '-loadKey true')
AS (iata,airport,city,country,lat,long); Now that we have our data in a variable lets dump it to hdfs using csv format by issuing: store airport_data into 'airportData/export' using PigStorage(','); So we have dumped the export into hdfs directory airportData/export. Lets go view it And there you go. We have loaded data into hbase table. Exported data from the table using pig in csv format. Happy pigging.
... View more
Labels:
07-08-2016
01:28 AM
@slachterman Good catch. fixed.
... View more
- « Previous
- Next »