About sunile_manjee

sunile_manjee · ‎07-21-2016

I have a few steps in my ETL process. In some steps in the middle of the process I do a join between 1 large and another small table. The performance is GREAT (under 2 minutes) when I only use a few columns from the big table. However once Increase the number of columns (almost all columns from big table) the perform suffers BIG TIME. The tables are compressed using zlib. Partitioned and bucketed. The select statement is using fields on partitioned and bucketed fields. Yes we have done with map side join, bucket join, sorted bucket join, etc. So clearly there is a impact to number of columns selected from ORC and performance. Few column and performance awesome. More and performance not so good. Any interesting work arounds? Should be using a different table format...avro(not a fan) or sequence....?

sunile_manjee · ‎07-21-2016

@david serafini I can't answer about the impact of difference. However I can recommend being consistent with the versions supported. If it is not listed I would not do it. You will run into issues ie like the ones you are running into now. Just my .02

sunile_manjee · ‎07-21-2016

@Andrew Watson have you seen this?

sunile_manjee · ‎07-21-2016

ok I understand. I have only seen it implemented like this: committer = ReflectionUtils.newInstance(conf.getClass( "mapred.output.committer.class", DirectFileOutputCommitter.class, org.apache.hadoop.mapred.OutputCommitter.class), conf); }

sunile_manjee · ‎07-21-2016

@Dayou Zhou can you please set this property as well to make it work mapreduce.use.directfileoutputcommitter=true

sunile_manjee · ‎07-20-2016

HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number: What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it. Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!

sunile_manjee · ‎07-20-2016

@Artem Ervits @Mehrdad Niasari I believe we can lose this question. i have opened new one on default namespace here.

sunile_manjee · ‎07-20-2016

Another option. Ambari has most of the metrics yoi are looking for. Simple from you edge node using curl call ambari api to fetch the stats you are looking for. Here is more in ambari api https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Metrics+API+specification For example http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics/dfs/FSNamesystem/CapacityUsed or you can view all metrics http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics

sunile_manjee · ‎07-20-2016

And one more $ hadoop fs -df -h Filesystem Size Used Available Use% hdfs://host-192-168-114-48.td.local:8020 7.0 G 467.5 M 18.3 M 7%

sunile_manjee · ‎07-20-2016

Report example sudo -u hdfs hdfs dfsadmin -report Configured Capacity: 7504658432 (6.99 GB) Present Capacity: 527142912 (502.72 MB) DFS Remaining: 36921344 (35.21 MB) DFS Used: 490221568 (467.51 MB) DFS Used%: 93.00% Under replicated blocks: 128 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 192.168.114.48:50010 (host-192-168-114-48.td.local) Hostname: host-192-168-114-48.td.local Decommission Status : Normal Configured Capacity: 7504658432 (6.99 GB) DFS Used: 490221568 (467.51 MB) Non DFS Used: 6977515520 (6.50 GB) DFS Remaining: 36921344 (35.21 MB) DFS Used%: 6.53% DFS Remaining%: 0.49% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 2

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

ORC query on all columns

Re: unresolved packages installing HDP 2.4.2 on Su...

Re: Spark dynamic-allocation dont work

Re: mapred.output.committer.class

Re: mapred.output.committer.class

Phoenix HBase Tuning - Quick Tips

Re: Where do I control hbase namespace from ranger...

Re: How to check the total storage capacity of the...

Re: How to check the total storage capacity of the...

Re: How to check the total storage capacity of the...