Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4046 | 08-20-2018 08:26 PM | |
| 1946 | 08-15-2018 01:59 PM | |
| 2373 | 08-13-2018 02:20 PM | |
| 4106 | 07-23-2018 04:37 PM | |
| 5012 | 07-19-2018 12:52 PM |
07-21-2016
06:43 PM
I have a few steps in my ETL process. In some steps in the middle of the process I do a join between 1 large and another small table. The performance is GREAT (under 2 minutes) when I only use a few columns from the big table. However once Increase the number of columns (almost all columns from big table) the perform suffers BIG TIME. The tables are compressed using zlib. Partitioned and bucketed. The select statement is using fields on partitioned and bucketed fields. Yes we have done with map side join, bucket join, sorted bucket join, etc. So clearly there is a impact to number of columns selected from ORC and performance. Few column and performance awesome. More and performance not so good. Any interesting work arounds? Should be using a different table format...avro(not a fan) or sequence....?
... View more
Labels:
- Labels:
-
Apache Hive
07-21-2016
04:54 AM
@david serafini I can't answer about the impact of difference. However I can recommend being consistent with the versions supported. If it is not listed I would not do it. You will run into issues ie like the ones you are running into now. Just my .02
... View more
07-21-2016
04:42 AM
ok I understand. I have only seen it implemented like this: committer = ReflectionUtils.newInstance(conf.getClass(
"mapred.output.committer.class", DirectFileOutputCommitter.class,
org.apache.hadoop.mapred.OutputCommitter.class), conf);
}
... View more
07-21-2016
03:34 AM
@Dayou Zhou can you please set this property as well to make it work mapreduce.use.directfileoutputcommitter=true
... View more
07-20-2016
09:46 PM
6 Kudos
HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster
What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number:
What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it.
Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!
... View more
Labels:
07-20-2016
09:34 PM
@Artem Ervits @Mehrdad Niasari I believe we can lose this question. i have opened new one on default namespace here.
... View more
07-20-2016
07:13 PM
Another option. Ambari has most of the metrics yoi are looking for. Simple from you edge node using curl call ambari api to fetch the stats you are looking for. Here is more in ambari api https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Metrics+API+specification For example http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics/dfs/FSNamesystem/CapacityUsed or you can view all metrics http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics
... View more
07-20-2016
07:08 PM
And one more $ hadoop fs -df -h Filesystem Size Used Available Use%
hdfs://host-192-168-114-48.td.local:8020 7.0 G 467.5 M 18.3 M 7%
... View more
07-20-2016
07:07 PM
Report example sudo -u hdfs hdfs dfsadmin -report
Configured Capacity: 7504658432 (6.99 GB)
Present Capacity: 527142912 (502.72 MB)
DFS Remaining: 36921344 (35.21 MB)
DFS Used: 490221568 (467.51 MB)
DFS Used%: 93.00%
Under replicated blocks: 128
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (1):
Name: 192.168.114.48:50010 (host-192-168-114-48.td.local)
Hostname: host-192-168-114-48.td.local
Decommission Status : Normal
Configured Capacity: 7504658432 (6.99 GB)
DFS Used: 490221568 (467.51 MB)
Non DFS Used: 6977515520 (6.50 GB)
DFS Remaining: 36921344 (35.21 MB)
DFS Used%: 6.53%
DFS Remaining%: 0.49%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
... View more