Member since
09-09-2016
15
Posts
0
Kudos Received
0
Solutions
03-08-2019
06:10 PM
you set it to 24 hours, right? we also set it to 24 hours - at least we do not have 100,000s Disk rowsets but we just started last week.
... View more
03-01-2019
12:58 PM
we hit exact same issue which was totally unexpected and a month before we would go LIVE. @huaj how is the fix working for you so far?
... View more
02-27-2018
09:03 PM
wanted to mention my post on a subject How to run Sqoop from NiFi
... View more
02-27-2018
08:58 PM
Hi @regie canada, check my blog post on this subject How to run Sqoop from NiFi Boris
... View more
08-24-2017
05:09 PM
thanks for your response @msumbul. I tried to use sqoop and also hbase MR export tool and both were really slow. I am just curious conceptually how other companies deal with it because I saw it is a very popular design.
... View more
08-24-2017
10:39 AM
problem is when I did select from hbase to hive it would not even use a lot of containers. First time I ran it, it only used 1 mapper. We allocate 1 CPU core and 4Gb of RAM to a yarn container. We can run ~240 containers at once but we i did the split in hbase cli, it would only use 1 mapper. Do you mean that when executing the query "CREATE anothertable as SELECT * FROM hbasetable", the yarn application launched use only one mapper for processing the query ? exactly By the way, which parameters differ from your several tests (the number of mapper shouldn't be that different between execution. Unless you change something) ?I did split mytable command in hbase shell. First time I did the split, it started using 2 mappers. I did a split again, it started using 5 or 6..I kept doing it and running tests every time 🙂 On the tests with 33 and 158 mappers how was behaving the CPU on the worker nodes ? were they at 10% CPU usage or 100% ? it was barely using CPUs Was there some contention (pending containers) ? no, this was the only job running on our cluster barely using a few containers. Do you have any benchmarks for similar type of job? what numbers in terms of rows per sec one should I expect? or some examples with larger tables? I cannot find any numbers on the web, just pictures of this design. I am going to Strata next month and I will just stop people and ask them this question 🙂
... View more
08-24-2017
06:02 AM
and I forgot to mention why I need to materialize them - we use Impala a lot for analytical queries and we like to keep our queries running fast. While I can use Impala directly with HBase tables, our typical queries will have 7-10 tables joined and this is not working with HBase.
... View more
08-24-2017
05:59 AM
Hi Mathieu, thanks for your response. I just curious conceptionally how other companies are doing it and what real world numbers are for materialization of tables from HBase to Hive. Our test cluster is 6 nodes cluster. Each has 88 hyperthreaded cores, 256Gb RAM. All 6 are workers. When we tested HBase, I think yarn could use 1Tb of RAM total and 400 cores. The rest was Hbase. My test tables were also very small - one was 1.7M rows and another one 22M rows. I would load initial data with sqoop directly to HBase (which was also slower btw because it would use a bunch of reducers) and then I would start pushing incremental changes to Hbase using sqoop. When I would create external table in HIVE pointed to HBase table directly or a snapshot (which was x2 times faster) and do a CREATE anothertable as SELECT * FROM hbasetable . I did notice it would use only one mapper (!) and I split the table in HBase a few times: Encounter table 22.8M rows Initial load sqoop >> hbase 5237 sec Reducers took a while Hbase to hive 3 mappers 11,700 sec 3.5 hrs Took forever! Hbase to hive 33 mappers 3635 sec Still a lot, but x3 faster Hbase to hive 158 mappers Killed after 1 hour 40 min and this is my tests with a smaller table Check Hive over Hbase snapshots Organization table 1.7M VM No snapshot - count(1) 53 sec No snapshot - CTAS create table organization_hbase_copy as select * from organization; 148 sec snapshot - count(1) 43 sec Snapshot - CTAS Drop table if exists organization_hbase_copy; create table organization_hbase_copy as select * from organization; 63 sec does it help?
... View more
08-23-2017
07:45 PM
One of the popular techniques to upsert (update / delete) data on Hadoop is to use a combination of HBase and Hive. I saw a bunch of architecture slides from various companies, streaming data from Kafka to HBase and then materialize HBase tables once a day to Hive or Impala. We tested this approach on our 6 node cluster and while it works, the last piece (persist HBase tables to Hive) is extremely slow. For example, one of my tables had 22 million rows and it took 1 hour(!) to persist that table to Hive using hive-hbase handler. I also checked Hive over HBase snapshot feature and it was 2 times faster but still took a long time. Is it supposed to be that slow? It is hard to imagine how it is going to work with billion row tables...
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
08-23-2017
12:31 PM
One of the popular techniques to upsert (update / delete) data on Hadoop is to use a combination of HBase and Hive. I saw a bunch of architecture slides from various companies, streaming data from Kafka to HBase and then materialize HBase tables once a day to Hive or Impala. We tested this approach on our 6 node cluster and while it works, the last piece (persist HBase tables to Hive) is extremely slow. For example, one of my tables had 22 million rows and it took 1 hour(!) to persist that table to Hive using hive-hbase handler. I also checked Hive over HBase snapshot feature and it was 2 times faster but still took a long time. Is it supposed to be that slow? It is hard to imagine how it is going to work with billion row tables...
... View more
07-14-2017
09:30 AM
we just learned a lesson that convert_legacy_hive_parquet_utc_timestamps=true may cause huge performance issues, read comments here https://issues.apache.org/jira/browse/IMPALA-3316
... View more