Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7357 | 08-12-2016 01:02 PM | |
| 2708 | 08-08-2016 10:00 AM | |
| 3667 | 08-03-2016 04:44 PM | |
| 7209 | 08-03-2016 02:53 PM | |
| 1863 | 08-01-2016 02:38 PM |
04-25-2016
09:34 AM
Tez tasks are mostly single threaded. Parallelization is achieved by tasks. So increasing the cores will not help you.
... View more
04-22-2016
05:11 PM
2 Kudos
100000 partitions would be too much. However what do you mean with "related files". A good best practice is to keep partitions under a couple thousand. ( till we have the HBase backed metastore ) However I would normally think date partition should be at most a couple thousand. Even ten years daily partitions would be only 3650. Just choose the granularity level accordingly. Its hard to give a general Design however since it depends on your queries and data. Lots of partitions can be ok if you always only query ONE partition ( hive metastore will do a key lookup of the partition ). However if you would have tens of thousands of partitions and do a full table scan? Then you run into severe problems because it will take a couple seconds in the hiveserver and can fail in Tez because the Execution graph gets too big. As usual these are things that need to be looked at in a ( and I hate this word ) holistic manner. Hard to give more than simple standard rules without understanding your data model and the queries you want to run. The hash approach amalay mentioned can be good as well if you always only need to query one partition. I have seen a setup with a two level partitioning by date and customerid % xxx which gave hive the ability to only select a small subset of the data. However hopefully we will have bucket pruning soon which could be a good alternative. I know I am shameless about it but for table setup I have made a pretty good reference here. It includes hints for optimal predicate pushdown as well http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
... View more
04-22-2016
01:47 PM
1 Kudo
I think you need to look into the Hiveserver2 log and see if he gives any additional information. It sounds more like your LDAP configuration is not correct. Perhaps some changes to the basedn, searchmask, requirement for SSL etc. You could try ldapsearch to see if you can theoretically connect.
... View more
04-22-2016
08:17 AM
Just a little comment. While in old versions of HDP for ORC files snappy provided performance benefits over zip this is not true anymore. Zip has three times better compression AND is as fast or faster now than snappy for most tables.
... View more
04-21-2016
03:43 PM
totally agreed, pulling out a big table into the client and doing the diff there will not scale.
... View more
04-21-2016
03:38 PM
1 Kudo
I am sure that there is no outage during a major compaction. Compactions are done on the store files while the old files still exist and then the files are switched out. I don;t think that basic process changes between minor and major compaction. The difference is that major compactions take all store files and remove deleted rows as well. So they have more impact on the cluster. Sometimes when all files are selected for a minor compaction he will do a major anyhow. So no unless an HBase commiter jumps in and tells me otherwise there is no outage during a major compaction. http://www.slideshare.net/cloudera/hbasecon-2013-compaction-improvements-in-apache-hbase
... View more
04-21-2016
03:17 PM
So essentially you have three options: a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert. INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging; b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016"; INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016"; c) you use dynamic partitioning and let hive figure it out INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date from staging; The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something ) You also need to enable some parameters for dynamic partitioning: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
... View more
04-21-2016
01:16 PM
2 Kudos
I am pretty sure that it is not correct that HBAse is blocked during major compactions. I.e. I tried to find a definitive statement but didn't find any, however I am very sure that you can still read and write from a region during major compaction. However there will be a heavy impact on IO and CPU on the region servers as the storefiles are rewritten so they are normally scheduled during the night. If I am wrong on this one please clarify. Region splits are essentially immediate since regions are logically split into two and will be rewritten during the next compactions. So there may be some impact but it should be very quick. Region merging is an interesting question, I am not aware of the process for this.
... View more
04-21-2016
01:03 PM
This! Query server essentially is built on Apache Calcite and jas a JSON API interface. https://phoenix.apache.org/server.html http://calcite.apache.org/avatica/docs/json_reference.html I would think however that its way too much work to write a client using a REST API over this. If you want to create a web application, something like jquery with the jdbc client sounds much more appealing. ( Or go against HBase directly )
... View more
04-21-2016
11:36 AM
1 Kudo
You cannot overwrite one column you need to recreate the whole table. Like in the CTAS discussion we had. So if your employees table has 10 columns you need something like INSERT OVERWRITE TABLE employees SELECT employees.<all columns but salary_date>, salary.salary_date FROM salary employees WHERE salary.employee_number = employee.employee_number; You also need to order the columns correctly like: select e.id, e.date,salary.salary_date,e.name,e.lastname. ....
... View more