About sunile_manjee

sunile_manjee · ‎09-13-2016

I read the Atlas HA doc here http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/Ambari-Trunk/bk_Ambari_Users_Guide/content/apache_atlas_high_availability.html Does atlas failover to secondary metastore? how many metastores are allowed?

sunile_manjee · ‎09-13-2016

nevermind. I found the issue. hbase on sandbox must be up and running

sunile_manjee · ‎09-13-2016

Atlas UI is not coming up on my HDP 2.5 sandbox. I have started Atlas, log search, and kafka. I see the following error in the atlas log: Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: Wed Aug 24 22:48:43 UTC 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68415: row 'atlas_titan,,' on table 'hbase:meta' at region=hbase:meta,,1.158 8230740, hostname=sandbox.hortonworks.com,16020,1470279629513, seqNum=0 Does HBase has to be up and running? Does Atlas use a embbeded version or the one I start from ambari?

sunile_manjee · ‎09-13-2016

I often hear stories of wanting faster performance from Hadoop & spark without knowing basic statistics within ones environment. One of the first questions I ask is whether the hardware can perform at the level which is being expected. The software is still bound to the physics of the hardware. If your IO disk speed is 10MB per sec, Hadoop/Spark nor any other software will magically make that disk speed faster. Again we are bound to the physical limits of the hardware we choose. What makes Hadoop and other distributed processing engines amazing is its ability to add more "cheap" nodes to the cluster to increase performance. However we should be aware the maximum throughput per node. This will help level set expectations before committing to any SLA bound to performance. Typically I love to use the sysbench tool. SysBench is a modular & multi-threaded benchmark tool for evaluating OS parameters ie. CPU, ram, IO, and mutex. I use sysbench prior to installing any software outside the kernel and pre/post Hadoop/Spark upgrades. Pre/post upgrades should not have any impact to your OS benchmarks but I play it safe. My neck is on the line when I commit to a SLA so I rather play it safe. The below tests I generally wrap in a shell script for ease of execution. For this article I call out each test for clarification. RAM test I start with testing RAM performance. This test can be used to benchmark sequential memory reads or writes. I test both. To test read performance I set memory block size to HDFS block size, number-threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=read --num-threads=4 --memory-total-size=10G run To test write performance I set memory block size to HDFS block size, number of threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=write --num-threads=4 --memory-total-size=10G run CPU test Next I grab the CPU performance numbers. This test consists in calculation of prime numbers up to a value specified by the --cpu-max-primes option. I set the number of threads = approx concurrency you expect on your cluster. sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run IO test Lastly I fetch the IO performance numbers. When using fileio, you will need to create a set of test files to work on. It is recommended that the size is larger than the available memory to ensure that file caching does not influence the workload too much - https://wiki.gentoo.org/wiki/Sysbench#Using_the_fileio_workload Run this command to prepare a file which is larger then the available memory (Ram) on the box. In this example my box has 128GB of ram. I set the file size to 150G. I named the file here fileio. sysbench --test=fileio --file-total-size=150G prepare Next I run the io test using the file I just created (fileio). file-test-mode is the type of workload to produce. Possible values: seqwr sequential write seqrewr sequential rewrite seqrd sequential read rndrd random read rndwr random write rndrw combined random read/write init-rng - specifies if random numbers generator should be initialized from timer before the test start - http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf max-time - is the limit for the total execution time in seconds. 0 means unlimited. be careful. set a limit. max-request - is the limit for the total request. 0 means unlimited sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run

sunile_manjee · ‎09-06-2016

@Rendiyono Wahyu Saputro I recommend you look at storm vs spark in a different manner. if your stream response can handle some latency (as little as 1/2 a second) then spark may be the way to go. This is just my opinion as spark streaming is so darn easy. Storm is a POWERFUL engine with virtually zero latency. Storm has been clocked on millions of tuples per node per second. So you have to ask yourself if your use case needs zero latency or can you handle micro batch (spark streaming)

sunile_manjee · ‎09-02-2016

@Randy Gelhausen and @ssoldatov thank you for your responses.

sunile_manjee · ‎09-02-2016

Does phoenix update global index during bulk load? curious if this is supported and how it works.

sunile_manjee · ‎09-02-2016

@ARUN if this is for production then add hbase master HA node. for PQS you can install on 1 load to start with. once the load increase then you can add more for load balancing.

sunile_manjee · ‎08-29-2016

Are ranger polices enforced during HCatalog API? Does it make any difference if I am using embbed or remote metastore?

sunile_manjee · ‎08-29-2016

Are ranger polices enforced for HCatalog embedded metastore? I have a metastore for Hiveserver2 and another one for HCatalog. If I created a hive policy via ranger will the policy be enforced on both metastores?

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Does Atlas automatically failover to secondary met...

Re: Atlas UI not coming up

Atlas UI not coming up

Benchmark your hardware for Hadoop & Spark

Re: Performance Comparison Between Spark and Storm

Re: Does phoenix update global index during bulk l...

Does phoenix update global index during bulk load?

Re: Hbase Master

Are ranger polices enforced during HCatalog API?

Are ranger polices enforced for HCatalog embedded ...