Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4035 | 08-20-2018 08:26 PM | |
| 1935 | 08-15-2018 01:59 PM | |
| 2368 | 08-13-2018 02:20 PM | |
| 4098 | 07-23-2018 04:37 PM | |
| 5004 | 07-19-2018 12:52 PM |
09-13-2016
08:52 AM
4 Kudos
I often hear stories of wanting faster performance from Hadoop & spark without knowing basic statistics within ones environment. One of the first questions I ask is whether the hardware can perform at the level which is being expected. The software is still bound to the physics of the hardware. If your IO disk speed is 10MB per sec, Hadoop/Spark nor any other software will magically make that disk speed faster. Again we are bound to the physical limits of the hardware we choose. What makes Hadoop and other distributed processing engines amazing is its ability to add more "cheap" nodes to the cluster to increase performance. However we should be aware the maximum throughput per node. This will help level set expectations before committing to any SLA bound to performance. Typically I love to use the sysbench tool. SysBench is a modular & multi-threaded benchmark tool
for evaluating OS parameters ie. CPU, ram, IO, and mutex. I use sysbench prior to installing any software outside the kernel and pre/post Hadoop/Spark upgrades. Pre/post upgrades should not have any impact to your OS benchmarks but I play it safe. My neck is on the line when I commit to a SLA so I rather play it safe. The below tests I generally wrap in a shell script for ease of execution. For this article I call out each test for clarification. RAM test I start with testing RAM performance. This test can be used to benchmark sequential memory reads or writes. I test both. To test read performance I set memory block size to HDFS block size, number-threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=read --num-threads=4 --memory-total-size=10G run To test write performance I set memory block size to HDFS block size, number of threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=write --num-threads=4 --memory-total-size=10G run CPU test Next I grab the CPU performance numbers. This test consists in calculation of prime numbers up to a value specified
by the --cpu-max-primes option. I set the number of threads = approx concurrency you expect on your cluster. sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run IO test Lastly I fetch the IO performance numbers. When using fileio, you will need to create a set of test files to work on. It is recommended that the size is larger than the available memory to ensure that file caching does not influence the workload too much - https://wiki.gentoo.org/wiki/Sysbench#Using_the_fileio_workload Run this command to prepare a file which is larger then the available memory (Ram) on the box. In this example my box has 128GB of ram. I set the file size to 150G. I named the file here fileio. sysbench --test=fileio --file-total-size=150G prepare Next I run the io test using the file I just created (fileio). file-test-mode is the type of workload to produce. Possible values: seqwr
sequential write
seqrewr
sequential rewrite
seqrd
sequential read
rndrd
random read
rndwr
random write
rndrw
combined random read/write
init-rng - specifies if random numbers generator
should be initialized from timer before the
test start - http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf max-time - is the limit for the total execution time in seconds. 0 means unlimited. be careful. set a limit. max-request - is the limit for the total request. 0 means unlimited sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
... View more
Labels:
09-06-2016
11:51 PM
@Rendiyono Wahyu Saputro I recommend you look at storm vs spark in a different manner. if your stream response can handle some latency (as little as 1/2 a second) then spark may be the way to go. This is just my opinion as spark streaming is so darn easy. Storm is a POWERFUL engine with virtually zero latency. Storm has been clocked on millions of tuples per node per second. So you have to ask yourself if your use case needs zero latency or can you handle micro batch (spark streaming)
... View more
12-30-2016
05:24 AM
Update the column and the column was on the index witn MR bulkload,It didn't update . And then I can get the old result if the condition on the index.https://issues.apache.org/jira/browse/PHOENIX-2521
... View more
09-06-2016
09:34 AM
1 Kudo
I dont think ranger policies will be enforced , because hcatalog dont use hiveserver2, and ranger hive plugin is supported for hiveserver2 only.
... View more
09-06-2016
09:35 AM
1 Kudo
ranger policies will be enforced if call to metastore goes through hiveserver2
... View more
09-01-2016
12:58 PM
@Sunile Manjee I understand KNOX currently do not support HAWQ since it does not have support to web rest api at this moment. HAWQ handles authentication like any other database like oracle ... or internally.
... View more
08-25-2016
02:58 PM
@kishore sanchina you will need to use a protocol. If you simply want to "push" local files to nifi, you can use the ListenHTTP processor. Then simply curl the file to nifi.
... View more
08-25-2016
02:08 PM
6 Kudos
In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out. Started the testing with dfsio which is a distributed IO benchmark tool. Here are results: From 3 to 5 data nodes IO read performance increased approx. 36% From 3 to 5 data nodes IO write performance increased approx. 49% With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures! Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data. From 3 to 5 data nodes TeraGen performance increased approx. 65% Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order. From 3 to 5 data nodes TeraSort performance increased approx. 54%. Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted. From 3 to 5 data nodes TeraValidate performance increased approx. 64%. DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations. All my test results are here.
... View more
Labels:
08-25-2016
02:43 PM
@Rajib Mandal for example on redhat you can run ipa krbtpolicy-mod theuser --maxlife=3600
... View more
01-09-2017
08:57 AM
Please advice if the Event driven will be now available/implemented for NiFi 1.x and later versions?
... View more