About bleonhardi

bleonhardi · ‎06-27-2016

Hive Transactions? Or a normal insert? Insert doesn't change anything since a new ORC file will be created and all ORC files will have their own bloom filter index. I am pretty sure the same is true for ACID tables as well since the compactor effectively creates a new ORC file.

bleonhardi · ‎06-27-2016

Different options. You mostly run aggregations but on a small subset of columns? Hive with ORC, this is a column compressed format so only the columns you need will actually be read. This means you would have to say goodbye to the json format but as long as your data model is pretty flat. ( There are lists and arrays in Hive as well ). If you restrict by a column as well employ partitioning/sorting/predicate pushdown. You mostly run aggregations on a small number of rows ( thousands to millions ) Hbase/Phoenix sound like a good choice.

bleonhardi · ‎06-19-2016

Sounds more like your map task is not very efficient. What are you doing in it? The second thing I could see is if the sort memory is too small. But I would mostly look at you map code. http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded

bleonhardi · ‎06-17-2016

"How many concurrent queries can be executed on Hive" There is no hard limit. You have two potential bottlenecks a) HiveServer ( I could reach around 20-30 q/s in the newest Hiveserver version ) but you could add more Hiveservers. You should also increase heap to 8GB and you can increase handler threads a bit. b) Tez on Yarn. I reached up to 200 concurrent Hive Sessions (Application Masters ) once on a 50 node cluster But it depends much much more on the complexity of queries you are running. If your queries need 200 containers each on a cluster with 1000 slots you can only run 5 queries in parallel "How to configure Hadoop/Hive for scale of queries hitting it from the API layer" a) The most important part is to configure your queries to use as few resources ( containers ) as possible ( partitioning, predicate pushdown ... ). b) Container reuse Tez is already configured for Session and container reuse but make sure this works properly. You can also tweak the container release settings in tez. ( tez.am.session.min.held-containers tez.am.container.idle.release-timeout-max.millis tez.am.container.idle.release-timeout-min.millis ) Essentially each JDBC connection is one Session. Or you can initialize sessions and Hiveserver will distribute queries among them ( hive.server2.tez.sessions. per.default.queue, hive.server2.tez.initialize.default. sessions) . I prefer to keep the default but then you need to make sure your number of jdbc connections fits to the concurrency of your systemc) Number Sessions Always keep an eye out on the cluster. Also use the newest version of HDP or try disabling the timeline server ( in the hive settings not just switch it off ) "what data access layer then does one use with Other Platforms from the service layer into Hive/Hadoop? Is is just a JDBC connection at that point or something else?" A JDBC pool with x number of open connections that take queries from a pool is best. As said in default mode each jdbc connection will be one Application Master. Also jdbc is faster than beeline/hive shell when it comes to data transfer. "How to configure Hadoop/Hive for scale of queries hitting it from the API layer ? How many concurrent connnections to WebHDFS can be supported?" No ide what you mean. Webhdfs? Tez reads data directly from HDFS and it will not be the limit. Your limit will be either IO or much more likely CPU power in your cluster. Or some configuration bottleneck. Finally it all depends on you queries and your workload so nobody will give you a magical recommendation.

bleonhardi · ‎06-16-2016

If the files are small pig will group them together. you could disable that if you wanted to. Now if you zip them each zipped file will have one mapper ( you mean GZ not zip I hope the latter will not work ), since they cannot be split. Anyway if he groups them into one map task it sounds like they are not too big so the question is why they take so long. I would still look into Resourcem,anager logs and see what is going on. You can see how many bytes go in and out of each mapper and look in logs for what is going on.

bleonhardi · ‎06-16-2016

That is amazing!

bleonhardi · ‎06-16-2016

What does happen in the Yarn resource manager? Do you see 80 mappers in the yarn application that is kicked off? Have a look in the logs of one of them to see what is going on.

bleonhardi · ‎06-15-2016

There are pros/cons for both. VMs have a negative impact on performance so we would normally go for bare metal. Mapreduce is good in scaling to lots of discs/processes even on a single data node. However there are limits on VERY big nodes ( there are new Apollo servers with 24 drives ) you want to increase the HDFS DataNode memory and you may have issues with very big block reports being sent around. In that case logically splitting a node into multiple smaller VMs might solve these issues. But normally I would say go bare metal.

bleonhardi · ‎06-14-2016

Thanks for the facebook link, never used it before. I like the brickhouse ones: https://github.com/klout/brickhouse

bleonhardi · ‎06-14-2016

@Daniel Perry Oh one last idea if you use sqoop? If you cannot use the sqoop metastore you might be able to run sqoop in a shell/ssh action grep the output of the sqoop incremental import which will display the last value save it in hdfs and use it next time. Of course a bit ugly but ...

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Bloom filter maintenance or updates?

Re: How to decide the final output location?

Re: Mapreduce - GC overhead limit exceeded

Re: How many HIVE concurrent queries can be execut...

Re: Apache Pig - Load 80 files into another direco...

Re: Simple example of Jenkins-HDP integration

Re: Apache Pig - Load 80 files into another direco...

Re: Which way of HDP cluster setup is best, having...

Re: Hive in-built functions

Re: What is the fastest way to get the most recent...