About bleonhardi

bleonhardi · ‎03-17-2016

Hmmm I suppose it depends how you see the question. Pretty sure everybody uses generics but you are right I haven't implemented a generic class myself in a MapReduce job before. Pretty sure a lot of libraries do it but yourself during coding. Not really. I think the reason for the question might help.

bleonhardi · ‎03-17-2016

RDBMS: Sqoop Log files: Nifi/Flume or manual load ( hadoop put plus scripts ) MQ Messages/events: Nifi, Storm, Spark Streaming, ... Flat Text Files: Oozie or cron jobs with scripts , Nifi/Flume,

bleonhardi · ‎03-17-2016

You mean the normal Java Generics? https://en.wikipedia.org/wiki/Generics_in_Java Like ArrayList<String>() Generics? In that case I am pretty sure almost every MapReduce job out there uses it. MapReduce can run any Java code that can be executed by the JVM currently in HDP its Java8. So if you want to use Lambda functions you could as well. And Generics have been around forever so I would hope any well written library uses them.

bleonhardi · ‎03-16-2016

Yarn does not do this. There is no relationship AFAIK. It is done by the oozie server. If oozie doesn't do it ( which can happen ) the jobs run on till they finish.

bleonhardi · ‎03-16-2016

Basically yes. But it is more complex since it is often hard to predict the size of one join. Normally you have where conditions multiple hierarchy joins etc. So it's hard to say how big a dataset will be during the query depending on the filter conditions. That's where statistics come in. Cbo is important and using analyze statement is important to gather the statistics.

bleonhardi · ‎03-16-2016

Hi Klaus, Hadoop on 1GB doesn't make much sense. Also YARN is not really flexible when it comes to memory allocation. The nodemanagers get a maximum of memory they can use to schedule yarn tasks ( mapreduce, spark, ... ) yarn.nodemanager.resource.memory-mb And a minimum which is also the multiple of task slots available yarn.scheduler.minimum-allocation-mb So if you have a maximum of 16GB and a minimum of 1GB he can give out up to 16 task slots to applications. But AFAIK there is no way to dynamically change that without restarting the nodemanagers. Yarn makes sure no task exceeds the slots it is provided but it doesn't give a damn about Operation System limits. So if you give your VM 4GB and set the yarn memory to 32GB yarn will happily schedule tasks till your system goes to its knees. You can of course enable swapping but that will result in bad performance. So summary: Flexible memory settings on a nodemanager node is not a good idea.

bleonhardi · ‎03-16-2016

I assume you mean a map-side join in Hive. ( I.e. small dataset is replicated to all map tasks and then join is done on map side vs. the standard shuffle or distributed join which distributes both tables around. ) Its actually easy. Assume you have 1 table with 1TB and 1 table with 1MB Assume as well that we have 50 nodes. ( I think he only copies datasets for the distributed cache once per node and not once per task might be wrong ) So MapSide Join: You have to copy the small table to every node: 50x1MB = 50MB of data are copied across the network Shuffle Join: Both tables need to be copied once i.e. 1TB + 1MB = 1.00001 TB will be copied over the network. MapSide Join is much better than Shuffle join in this case. Assume you have 1 table with 1TB and one table with 500GB: MapSide Join: 500GB table needs to be copied to 50 nodes for 50 * 500 = 25 TB of data being copied over the network Shuffle Join: 1.5TB of data need to be copied over the network. So in this case MapSide join is much worse than shuffle join. And the CBO tries to figure out which one is better. This gets harder because of where conditions etc. The same is true for join types in pig/Spark etc. You can do the math yourself 🙂

bleonhardi · ‎03-16-2016

Not sure what you mean with data pruning. Hive does partition pruning which is not this. It currently doesn't do bucket pruning even though I have seen attempts to implement this. What I mean is predicate push down. I.e. The ability of ORC files to skip stripes of data based on min/max values and bloom filters. The difference between pruning ( happens in optimizer) and predicate push down ( happens in task) is that PD needs to at least schedule the task. If you bucket by day you know that most buckets can be pretty much skipped completely. Even if they close immediately. However that often doesn't help since you want parallel processing ( the slowest task defines the query). So a sort by during insert is most of the tkme better. See the ppt I linked to for details. Regarding number of buckets it again depends what you want. People always want simple rules but there aren't any. It depends on your data characteristics. For example customer id just distributes more or less equally between buckets. You only get advantages during load since you can decide the number of reducers. If you have 50 slots in yarn 50 buckets would result in the fastest load. However if the data volume is too small that might be bad. Tradeoff. Or if you want to sample by customer ID I.e. Only query some customers. Buckets might help. Depending on how much you want to sample. Or if you want a bucket join they might help. But again no real rule for number of buckets. Depends on how many map tasks you want to do the join. Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.

bleonhardi · ‎03-15-2016

Depends what you are trying to do. - One thing buckets are used for is to increase load performance Essentially when you load data you often do not want one load per mapper ( especially for partitioned loads because this results in small files ), buckets are a good way to define the number of reducers running. So if your cluster has 40 task slots and you want the fastest ORC creation performance possible you would want 40 buckets. ( Or distribute your data by 40 and set number of reducers ) http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data - SELECT performance ( predicate pushdown ) Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer id if you have to have buckets for some of the other reasons. - Join Performance ( bucket join ) Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one. - Sampling performance Some sample operations can get faster with buckets. So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). And I think sampling is a bit niche. So all in all avoid if you don't know too well what you are doing, and regarding size? 1-10 blocks is a pretty good size for hadoop files. Buckets should not be much smaller ( unless you want very fast answer rates on small amounts of data no rule without exceptions )

bleonhardi · ‎03-15-2016

One little extra comment: You do not need any fencing method for the failover. The QJM and zookeeper quorums make sure only the active namenode can write to the fsimage. However it is possible that a zombie active namenode might still give outdated read-only requests to connected clients. That's where fencing comes in. However if configured the active namenode will wait for the fencing method to return success. So you need to be sure that your method does not block (by configuring a timeout in your ssh action for example ) and that in the end it returns success. I.e. either use a script that returns success in any case or have multiple non-blocking methods that end with one that returns true in any case

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Use of java generics in hadoop map reduce jobs...

Re: Best tools to ingest data to hadoop

Re: Use of java generics in hadoop map reduce jobs...

Re: Is there some API to determine parent-child re...

Re: How the Replicated join gives better performan...

Re: Dynamic memory allocation in a virtual machine...

Re: How the Replicated join gives better performan...

Re: Hive - Deciding the number of buckets

Re: Hive - Deciding the number of buckets

Re: Question on hdfs automatic failover