About bleonhardi

bleonhardi · ‎07-25-2016

If you use a distribution like HDP you cannot individually upgrade a component. They are tested and supported together. So if you want a newer version of Hive you would need to upgrade the whole distribution. HDP 2.5 will have a technical preview of Hive 2.0 for example. If you don't care about support then good luck. You can just try to install Hive manually. The problem is that you will need to upgrade Tez manually as well. The hive.apache.org website will have instructions on getting it to run under "builds".

bleonhardi · ‎07-25-2016

@Sunile Manjee Short answer theoretically ORC ALWAYS makes sense. Just less so than if you have a subset of columns. Then its no question - Its stored in a protobuf format so parsing the data is much faster than deserializing strings. - It enables vectorization, if you aggregate on a column ORC can read 10000 rows at a time and aggregate them all in one go. Much better than parsing one row at a time. - And features like predicate pushdown if you have where conditions. Once you read all columns there is no magic anymore it will take some time. I would focus on the query analysis and trying to identify any bottlenecks but my guess would be that ORC still are your best bet.

bleonhardi · ‎07-22-2016

In your other question it looked like there was simply a bug in the phoenix part of the hbase installation. Sometimes that happens. A support ticket would log that. But I am 99% sure that normally phoenix gets installed without any yum commands. in 2.3 and 2.4

bleonhardi · ‎07-22-2016

Phoenix is installed per default in HDP 2.3 at least in my version. Its just a set of libraries in hbase that are always installed unless I am completely mistaken now. Or do you refer to the Phoenix Query server? Which gets installed as a client on the nodes? ( When you install you can select it in the window where you also select datanodes, nodemanagers etc. If you forgot doing that you can install the PQS later on a host using Ambari on the host page. But the Phoenix libraries should be installed with hbase by default.

bleonhardi · ‎07-22-2016

1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this: a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function. b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free. 3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted. 4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation. No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.

bleonhardi · ‎07-22-2016

@Arunkumar Dhanakumar You can simply compress text files before you upload them. Common codecs include gzip, snappy and lzo. HDFS does not care. All Mapreduce/Hive/pig jobs support these standard codecs and identify them by their file extension. If you use gzip you just need to make sure that each file is not too big since its not splittable. I.e. each gzip file will result in one mapper. You can also compress the output of jobs. So you could run a pig job that reads the text files and writes them again. I think you simply need to add the name .gz for example to the output. Again you need to understand that now each part file is gzipped and will run in one mapper later. Lzo and snappy on the other hand are splittable but do not provide as good a compression. http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig

bleonhardi · ‎07-22-2016

It is a Tez application. They stay around for a while to wait for new dags ( execution graph) otherwise you need to create a new session for every query which adds around 20s to your query time. Its configured here (normally a couple minutes) tez.session.am.dag.submit.timeout.secs

bleonhardi · ‎07-22-2016

1. On what basis Application Master decides that it needs more containers? Depends completely on the application master. For example in Pig/Hive he computes the number of mappers based on input splits ( blocks ) so if you have a 1GB file with 10 blocks he will ask for 10 map containers. If you then specified 5 reducers the application master will then ask for 5 containers for the reducers. This is a different configuration for each type of yarn workload. One example of "dynamic" requests are in slider which is a framework you can "flex" up or down based on command line. But again in the end the user tells slider to request more. There is no magic scaling inherent to yarn it depends on your application. 2. Will each mapper have separate containers? In classic MapReduce One map = 1 container. ( Tez on the other hand has container reuse so a container it asked for for a "map" task will then be used for a reducer for example ). And finally we will soon have LLAP which can run multiple map/reduce task in the same container as threads similar to a Spark Executor. So all is possible. 3. Let's say one mapper launched in container and mapper completed 20% of work and if it requires more resources to complete remaining 80% of the task then how the resources will be allocated and who will allocate? If Distribution happens between containers then how it will happen? Again depends. Mapreduce is stupid it knows in advance how much work there is ( number of blocks ) and then asks for the number of mappers/containers it needs and distributes the work between them. For the reducers hive/tez for example can compute the number of reducers it needs based on the output size of the mappers. But once the reducer stage is started it doesnt change that anymore. So your question is not really correct. Summary: In the end you assume Yarn would automatically scale containers on need but that is not what happens. What really happens is that the different workloads in yarn predict how much containers they need based on file sizes/ map output etc. and then ask for the correct amount of containers for a stage. There is no scaling up/down in a single task normally. What is dynamic is yarn providing containers. So if an application master asks for 1000 tasks and there are only 200 slots and some occupied by other tasks. Yarn can provide them to the application master piece by piece. Some application masters like Mapreduce are fine with that. Other application masters like spark will not start processing until they got all the containers running they requested at the same time. Again it depends on the application master. Now there is nothing prohibiting a Application Master to do this if he wanted to do that but its not what happens in reality for most of the workloads like Tez/MapReduce/Spark/ ... The only dynamic scaling I am aware of is in pig/hive between stages as in the application master predicts how many containers it needs for the reducer stage based on the size of the map output.

bleonhardi · ‎07-22-2016

First regarding ORC: It is a column store format so it only reads the columns you need. So yes less columns good more columns bad. however its still better than a flat file which reads all columns all the time and is not stored as efficiently ( protobuf, vectorization access ... ) . But its not magic. So the question is if you see big performance hits, is the join correct. Normally the CBO already does a decent job of figuring that out if you have statistics as Constantin says. So that is the first step. The second then is to analyze the explain plan and see if it makes sense. Worst case you could break up the query into multiple pieces with temp tables/with statements to see if a different order results in better performance. I am also a fan of checking the execution with hive.tez.exec.print.summary to see if there is a stage that takes a loong time and doesnt have enough reducers/mappers. I.e. a bottleneck.

bleonhardi · ‎07-22-2016

Auxlib works. Its the only thing that works consistently for me. Are you using the hive command line or beeline? Depending on this you need to put the jars into the directory of the hive server or the hive client. You also need to restart hive server.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: how to upgrade hive

Re: ORC query on all columns

Re: Why can I not install Phoenix through Ambari d...

Re: Why can I not install Phoenix through Ambari d...

Re: Why we have to do multiple times sorting oper...

Re: How much actual space required to store 10GB t...

Re: The job is complete, but has status running

Re: On what basis Application Master decides that ...

Re: ORC query on all columns

Re: How to permanently add SerDe JARs to Hive?