About bleonhardi

bleonhardi · ‎04-25-2016

Tez tasks are mostly single threaded. Parallelization is achieved by tasks. So increasing the cores will not help you.

bleonhardi · ‎04-22-2016

100000 partitions would be too much. However what do you mean with "related files". A good best practice is to keep partitions under a couple thousand. ( till we have the HBase backed metastore ) However I would normally think date partition should be at most a couple thousand. Even ten years daily partitions would be only 3650. Just choose the granularity level accordingly. Its hard to give a general Design however since it depends on your queries and data. Lots of partitions can be ok if you always only query ONE partition ( hive metastore will do a key lookup of the partition ). However if you would have tens of thousands of partitions and do a full table scan? Then you run into severe problems because it will take a couple seconds in the hiveserver and can fail in Tez because the Execution graph gets too big. As usual these are things that need to be looked at in a ( and I hate this word ) holistic manner. Hard to give more than simple standard rules without understanding your data model and the queries you want to run. The hash approach amalay mentioned can be good as well if you always only need to query one partition. I have seen a setup with a two level partitioning by date and customerid % xxx which gave hive the ability to only select a small subset of the data. However hopefully we will have bucket pruning soon which could be a good alternative. I know I am shameless about it but for table setup I have made a pretty good reference here. It includes hints for optimal predicate pushdown as well http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎04-22-2016

I think you need to look into the Hiveserver2 log and see if he gives any additional information. It sounds more like your LDAP configuration is not correct. Perhaps some changes to the basedn, searchmask, requirement for SSL etc. You could try ldapsearch to see if you can theoretically connect.

bleonhardi · ‎04-22-2016

Just a little comment. While in old versions of HDP for ORC files snappy provided performance benefits over zip this is not true anymore. Zip has three times better compression AND is as fast or faster now than snappy for most tables.

bleonhardi · ‎04-21-2016

totally agreed, pulling out a big table into the client and doing the diff there will not scale.

bleonhardi · ‎04-21-2016

I am sure that there is no outage during a major compaction. Compactions are done on the store files while the old files still exist and then the files are switched out. I don;t think that basic process changes between minor and major compaction. The difference is that major compactions take all store files and remove deleted rows as well. So they have more impact on the cluster. Sometimes when all files are selected for a minor compaction he will do a major anyhow. So no unless an HBase commiter jumps in and tells me otherwise there is no outage during a major compaction. http://www.slideshare.net/cloudera/hbasecon-2013-compaction-improvements-in-apache-hbase

bleonhardi · ‎04-21-2016

So essentially you have three options: a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert. INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging; b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016"; INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016"; c) you use dynamic partitioning and let hive figure it out INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date from staging; The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something ) You also need to enable some parameters for dynamic partitioning: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎04-21-2016

I am pretty sure that it is not correct that HBAse is blocked during major compactions. I.e. I tried to find a definitive statement but didn't find any, however I am very sure that you can still read and write from a region during major compaction. However there will be a heavy impact on IO and CPU on the region servers as the storefiles are rewritten so they are normally scheduled during the night. If I am wrong on this one please clarify. Region splits are essentially immediate since regions are logically split into two and will be rewritten during the next compactions. So there may be some impact but it should be very quick. Region merging is an interesting question, I am not aware of the process for this.

bleonhardi · ‎04-21-2016

This! Query server essentially is built on Apache Calcite and jas a JSON API interface. https://phoenix.apache.org/server.html http://calcite.apache.org/avatica/docs/json_reference.html I would think however that its way too much work to write a client using a REST API over this. If you want to create a web application, something like jquery with the jdbc client sounds much more appealing. ( Or go against HBase directly )

bleonhardi · ‎04-21-2016

You cannot overwrite one column you need to recreate the whole table. Like in the CTAS discussion we had. So if your employees table has 10 columns you need something like INSERT OVERWRITE TABLE employees SELECT employees.<all columns but salary_date>, salary.salary_date FROM salary employees WHERE salary.employee_number = employee.employee_number; You also need to order the columns correctly like: select e.id, e.date,salary.salary_date,e.name,e.lastname. ....

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Varying vcores/ram for hive queries running Te...

Re: Best Pratices for Hive Partitioning especially...

Re: HiveServer2 authentication with LDAP : Error v...

Re: Hive table format and compression

Re: How to compare two hive tables that are in dif...

Re: Any blocking during HBase compaction?

Re: Partitioning - no existing column suitable

Re: Any blocking during HBase compaction?

Re: does phoenix have a thrift or restful api?

Re: Overwriting a column in HIVE