Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5430 | 08-12-2016 01:02 PM | |
2204 | 08-08-2016 10:00 AM | |
2613 | 08-03-2016 04:44 PM | |
5517 | 08-03-2016 02:53 PM | |
1428 | 08-01-2016 02:38 PM |
06-27-2016
10:27 PM
2 Kudos
Hive Transactions? Or a normal insert? Insert doesn't change anything since a new ORC file will be created and all ORC files will have their own bloom filter index. I am pretty sure the same is true for ACID tables as well since the compactor effectively creates a new ORC file.
... View more
06-27-2016
02:41 PM
1 Kudo
Different options. You mostly run aggregations but on a small subset of columns? Hive with ORC, this is a column compressed format so only the columns you need will actually be read. This means you would have to say goodbye to the json format but as long as your data model is pretty flat. ( There are lists and arrays in Hive as well ). If you restrict by a column as well employ partitioning/sorting/predicate pushdown. You mostly run aggregations on a small number of rows ( thousands to millions ) Hbase/Phoenix sound like a good choice.
... View more
06-19-2016
02:23 PM
Sounds more like your map task is not very efficient. What are you doing in it? The second thing I could see is if the sort memory is too small. But I would mostly look at you map code. http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded
... View more
06-17-2016
08:14 AM
7 Kudos
"How many concurrent queries can be executed on Hive" There is no hard limit. You have two potential bottlenecks a) HiveServer ( I could reach around 20-30 q/s in the newest Hiveserver version ) but you could add more Hiveservers. You should also increase heap to 8GB and you can increase handler threads a bit. b) Tez on Yarn. I reached up to 200 concurrent Hive Sessions (Application Masters ) once on a 50 node cluster But it depends much much more on the complexity of queries you are running. If your queries need 200 containers each on a cluster with 1000 slots you can only run 5 queries in parallel "How to configure Hadoop/Hive for scale of queries hitting it from the API layer" a) The most important part is to configure your queries to use as few resources ( containers ) as possible ( partitioning, predicate pushdown ... ). b) Container reuse Tez is already configured for Session and container reuse but make sure this works properly. You can also tweak the container release settings in tez. ( tez.am.session.min.held-containers tez.am.container.idle.release-timeout-max.millis tez.am.container.idle.release-timeout-min.millis )
Essentially each JDBC connection is one Session. Or you can initialize sessions and Hiveserver will distribute queries among them ( hive.server2.tez.sessions. per.default.queue, hive.server2.tez.initialize.default. sessions) . I prefer to keep the default but then you need to make sure your number of jdbc connections fits to the concurrency of your systemc) Number Sessions Always keep an eye out on the cluster. Also use the newest version of HDP or try disabling the timeline server ( in the hive settings not just switch it off ) "what data access layer then does one use with Other Platforms from the service layer into Hive/Hadoop? Is is just a JDBC connection at that point or something else?" A JDBC pool with x number of open connections that take queries from a pool is best. As said in default mode each jdbc connection will be one Application Master. Also jdbc is faster than beeline/hive shell when it comes to data transfer. "How to configure Hadoop/Hive for scale of queries hitting it from the API layer ? How many concurrent connnections to WebHDFS can be supported?" No ide what you mean. Webhdfs? Tez reads data directly from HDFS and it will not be the limit. Your limit will be either IO or much more likely CPU power in your cluster. Or some configuration bottleneck. Finally it all depends on you queries and your workload so nobody will give you a magical recommendation.
... View more
06-16-2016
10:18 PM
If the files are small pig will group them together. you could disable that if you wanted to. Now if you zip them each zipped file will have one mapper ( you mean GZ not zip I hope the latter will not work ), since they cannot be split. Anyway if he groups them into one map task it sounds like they are not too big so the question is why they take so long. I would still look into Resourcem,anager logs and see what is going on. You can see how many bytes go in and out of each mapper and look in logs for what is going on.
... View more
06-16-2016
05:16 PM
That is amazing!
... View more
06-16-2016
10:50 AM
What does happen in the Yarn resource manager? Do you see 80 mappers in the yarn application that is kicked off? Have a look in the logs of one of them to see what is going on.
... View more
06-15-2016
01:23 PM
4 Kudos
There are pros/cons for both. VMs have a negative impact on performance so we would normally go for bare metal. Mapreduce is good in scaling to lots of discs/processes even on a single data node. However there are limits on VERY big nodes ( there are new Apollo servers with 24 drives ) you want to increase the HDFS DataNode memory and you may have issues with very big block reports being sent around. In that case logically splitting a node into multiple smaller VMs might solve these issues. But normally I would say go bare metal.
... View more
06-14-2016
12:25 PM
1 Kudo
Thanks for the facebook link, never used it before. I like the brickhouse ones: https://github.com/klout/brickhouse
... View more
06-14-2016
09:34 AM
@Daniel Perry Oh one last idea if you use sqoop? If you cannot use the sqoop metastore you might be able to run sqoop in a shell/ssh action grep the output of the sqoop incremental import which will display the last value save it in hdfs and use it next time. Of course a bit ugly but ...
... View more