Member since
10-18-2017
52
Posts
2
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1131 | 01-27-2022 01:11 AM | |
8146 | 05-03-2021 08:03 AM | |
4686 | 02-06-2018 02:32 AM | |
6163 | 01-26-2018 07:36 AM | |
4014 | 01-25-2018 01:29 AM |
01-27-2022
01:11 AM
For future reference: I am on a hbase cluster, and also need access to the hive metastore. It seems that in case the hive-site.xml contains some wrong values, you can have this behavior.
... View more
08-13-2021
12:51 AM
Maybe you are still asking more than what is available? It really depends on what kind of cluster you have available. It depends on following paramaters: 1)cloudera manager-> yarn-> configuration ->yarn.nodemanager.resource.memory-mb (= Amount of physical memory, in MiB, that can be allocated for containers=all memory that yarn can use on 1 worker node) 2)yarn.scheduler.minimum-allocation-mb (container memmory minimum= every container will request this much memory) 3)yarn.nodemanager.resource.cpu-vcores (Container Virtual CPU Cores) 4)how many worker nodes? Cluster with x nodes? I noticed you really are requesting a lot of cores too. Maybe you can try reduce these a bit? This might also be a bottleneck.
... View more
08-12-2021
01:32 AM
I see 2 things that would be good to understand: 1)why do the yarn containers exceed their size 2)why does he not provide the number of executors that you request? 1) It seems like you are exceeding the yarn container size of 10GB. The executors will run in yarn containers. Maybe you need to increase the minimum yarn container size a bit? I think the message suggests the minimum container size for yarn is 10GB. If you request a 8GB executor, and there is some (2GB)overhead, he might hit the ceiling of what was assigned to him and this executor will exit. 2) Looks like your cluster is not capable of providing the requested 10 executors of 8GB? Other relevant info to share would be: how many nodes do you have, and for each node, how much memory is assigned to yarn , and how much is the yarn minimum container size? Example: suppose the yarn container size is 3 GB. Suppose you have 9 nodes. Suppose your executor memory is 1GB . Suppose 10 GB on each node is allocated to yarn. This means you have on each node enough memory to start 3 containers (3x3GB< 10GB). THerefore, when dynamic allocation is enabled, he will start 27 execuctors. Even if you would ask for more than this 27, he will only be capable of providing 27. Maybe this helps?
... View more
08-11-2021
07:02 AM
Thank you for this very valuable input! (I had somehow missed the response). I see indeed increased latencies, but see that should be neglectable for hot data. I have observed this, but think there is a limit to how much data you can keep 'hot'. This depends on a combination of settings at the level of the hbase catalog properties and the hbcase cluster. We have discussed this also in following thread: https://community.cloudera.com/t5/Support-Questions/simplest-method-to-read-a-full-hbase-table-so-it-is-stored/m-p/317194#M227055 It would be very interesting if a more in depth study would ever be conducted and reported, as this is very relevant for applications with hbase as back-end that require some more advanced querying of the data (like in my case aggregations to compute a heatmap using a high volume of data points).
... View more
08-11-2021
06:52 AM
Dear experts, I notice when I try to load a hbase data in pyspark, it tells me java.io.IOException: Expecting at least one region for table : myhbasetable at org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase.getSplits(MultiTableInputFormatBase.java:195) at org.locationtech.geomesa.hbase.jobs.GeoMesaHBaseInputFormat.getSplits(GeoMesaHBaseInputFormat.scala:43) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130) It looks like it is telling me the table has to at least have some data in at least 1 region. This is the relevant piece of code: --> https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormatBase.java try (Connection conn = ConnectionFactory.createConnection(context.getConfiguration())) {
while (iter.hasNext()) {
Map.Entry<TableName, List<Scan>> entry = (Map.Entry<TableName, List<Scan>>) iter.next();
TableName tableName = entry.getKey();
List<Scan> scanList = entry.getValue();
try (Table table = conn.getTable(tableName);
RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
regionLocator, conn.getAdmin());
Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();
for (Scan scan : scanList) {
if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region for table : "
+ tableName.getNameAsString());
} can see in the hbase master, that his table has data spread out over 4 regions. And in hbase shell, I can scan the data with no error. This is on hbase 2.1. It seems he is not finding the fact there are regions for this table. I wonder what could cause this. Did anyone every encounter this error?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
05-26-2021
09:04 AM
Would by any chance you know of any page that very clearly describes the fields shown in the hbase master ui? I find lots of blogs about it, but would be interesteing in seeing one that brings clarity of all metrics.
... View more
05-26-2021
07:17 AM
Many thanks for this detailed answer which again far exceeded my expectations. I marked the whole discussion as the correct solution as it will be of interest for others. I only wanted to add : I conform I do see an effect of the rowcount. If I do my aggregation queries after running the rowCount mapreduce job, the time improves by a factor of 3-4. Possibly the consequence of one of the other settings I apply (like bloom filter? or something else)? When I look at the difference between L1/L2 cache, I see following behavior after running the MR job for rowcount: AFTER MR JOB, ON 1 REGION 12K BLOCKS IN L1 CACHE (BLOOM.INDEX) AFTER MR JOB, ON 1 REGION MANY MORE BLOCKS IN L2CACHE (DATA) This shows that most of caching happens in L2 cache. According to this article https://blog.cloudera.com/hbase-blockcache-101/ for me (using CombinedBlockCache), the L1 is used to store index/bloom filter blocks and the L2 is used to store data blocks. Now when I really use my application (and fire heavy aggregations, that now go much faster thanks to the MR job) I see similar behavior. As I assume the true aggregations will make use of caching, I almost am inclined to think that the MR job for rowcount does help me to fill the necessary L2 cache: After running my aggregate queries (now faster thansk to MR), L1 cache again slightly filled. After my aggregate queries, L2 cache also heavily filled.
... View more
05-21-2021
08:23 AM
Thank you for this very useful answer. Some remarks: 1)I have the feeling the row counter does help. When I let it run, the read time for that table afterwards is reduced by a factor of 3. I did not look into the details and would expect it just counts the rows and does not access the data, but my test suggests it does also store some of the data. 2) Isn't there any mapreduce job availalbe that will access all the data 1x then? If this were sql, I would write a order by or SORT(something that kicks off a full table scan)? I was also trying CellCounter, but this makes my regionservers become overloaded. 2) I have set the IN_MEMORY =true already for my tables. 3)For my use case I need to do lots of heavy aggregation queries. It is always on the same couple tables. I would prefer to have these cached completely. It would be very useful if there is a method to permanently cache them. I guess the best thing I can do is the in_memory=true, but it seems like it doesnt permanently store it. I have the feeling that after a while the cache is cleaned randomly. I guess this is what you mentioned about the LRU cache that will do this... I think I am just using the standard implementation (only changed it in cloudera manager using the hfile.block.cache.size to a higher value of 0.55). I did not understand this comment "We can use RegionServer Grouping for the concerned Table's Region & ensure the BlockCache of the concerned RegionServer Group is used for Selective Tables' Regions, thereby reducing the impact of LRU. ". So, there is a way to prevent that the cache is cleaned again for these tables? I can tell from the hbase master that the on-heap memory is only partially used (still more than 50% available, after I have accessed all my tables once). I have 180GB avaliable and the tables would fit in there. Too bad I can not permanently store them in memory. Relevant settings are: HBase: hbase.bucketsize.cache = 10 GB HBase: hfile.block.cache.size = 0.55 HBase: Java Heap Size HBASE regionserver = 20GB and at the catalog level: CACHE_DATA_ON_WRITE true CACHE_INDEX_ON_WRITE true PREFETCH_BLOCKS_ON_OPEN true IN_MEMORY true BLOOMFILTER ROWCOL 5)The article is very interesting-I am still digesting it. One important thing I see is that the cached data is decompressed (I was not aware of this). I see that it is recommended to have block sizes comparable to the row size of the data blocks you are interested in. Otherwise you are caching a majority of data you are not interested in. But I only use this cluster to access the same few tables which I all would like to have completely cached, so for me all data is relevant and can be cached, so maybe this will not help me. Finally, I should figure out how to know what the size is of 1 row of data. Once again, your responses are very advanced and helpful so thank you!
... View more
05-19-2021
08:50 AM
I have the feeling this is a good guess: hbase org.apache.hadoop.hbase.mapreduce.RowCounter It counts rows, but not entirely sure it will cache everything.
... View more
05-19-2021
04:04 AM
Dear Experts, I am interested in accessing all the data in a hbase table once, so it will be cached. I am using an application that reads from hbase and I notice it is a lot faster when the data was read before, so I would like to "preheat" the data by warming it up: accessing it all once, so it is stored in the cache. I am quite familiar with the cluster configuration necessary to enable caching but wonder what is the easiest way to do this. Is there a command that accesses all the data in all regions 1x?
... View more
Labels:
- Labels:
-
Apache HBase