About bleonhardi

bleonhardi · ‎02-08-2016

You cannot change the partitioning scheme on a table in Hive. This would have to rewrite the complete dataset since partitions are mapped to folders in HDFS. What you need to do is create a new table with the new partitioning scheme and load the data into it from the old table: CREATE TABLE NEWPARTITIONING ( COLUMNS ... ) PARTITON ON ( MONTHS INT, DAY INT ) as SELECT * from tablewitholdpartitioning. Loading a large number of partitions at a time can result in bad loading patterns so be careful and follow the guidelines in my doc: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎02-08-2016

Everything is installed in /usr/hdp. For example with my version its /usr/hdp/2.3.4.0-3485/hive/bin/. Normally you just need to initialize the db in a database of your choice. ( MySQL? ) ( following the commands with init schema etc. ) After that you need to point hive to that store. If your cluster is managed by ambari you do not change the configuration in hive-site.xml. You do it in Ambari. ( You use HDP right? ). Here you can change the database connection settings under the Hive/Config/Advanced section. You cannot change the hive-site manually ambari would overwrite it.

bleonhardi · ‎02-08-2016

Different options, Ambari tells you the version in the about page but if you are not sure about the agents you might have mixed installations? So the definitive way is: yum list | grep ambari-agent yum list | grep ambari-server in linux as root.

bleonhardi · ‎02-08-2016

This occurs if the database is there but somehow corrupted. He cannot read the schema version means he cannot find the table entry that contains the hive version. Can you run any query? I wouldn't think so. How about recreate it and point to the new correct one in ambari? Guidelines below: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/validate_installation.html

bleonhardi · ‎02-07-2016

May I ask why you care? Any specific curiosity or performance problem or just curiosity?

bleonhardi · ‎02-07-2016

Ok that was actually interesting so I had a look into the code. For open source projects always the definitive source: You can find most of it in the MapTask class. -> Map Phase is run, output goes into Sorting Collector or DirectCollector ( latter if no reduce tasks ) -> The write already uses the partitioner, i.e. data is partitioned when going into the Sorting Collector -> Sorting Collector is a class called MapOutputBuffer -> In here we have a combiner class and a Sorting class. Data is sorted in a memory buffer and then spilled. -> First data is sorted using the sort buffer then written to disc either directly OR using the CombinerRunner and then writing it. Combiner is also used in the Merge phase when spilled files are merged together into one output file So the correct answer is They do not happen "after" each other, they happen together. While output data is written it is partitioned sorted, spilled and combined, merged and combined again. Hope that helps. If you want more information just have a look into the code yourself. Its quite readable. Ben PS: So we had 5 Datanodes running map tasks. Which node does partition happens & how many partitions will be created? There is one partitioned output file for each Map Task. So you have 10 Map Task and 2 Reducers. This means that there will be 2 output files for each Map Task one for each reducer (*). Number partitions = number reducers ( for each map task ) When the reducer spins up he starts downloading the output file for his partition from every map task as they finish. And merge sorts it into one input set. In your example Both reducer each will pull 10 datasets ( one from each map task) and merge sort them into a single valid input set. (*)actually each map task only writes one file with offsets to not create too many small files if I am not mistaken but that doesn't change the basic functionality.

bleonhardi · ‎02-05-2016

I think there is a misunderstanding in what yarn does. It doesn't care at all how much memory is available on the Linux machines. Or about buffers or caches It only cares about the settings in the yarn configuration. You can check them in Ambari.It is your responsibility to set them correctly so they fit to the system. You can find on the yarn page of ambari: - The total amount of RAM available to yarn on any one datanode. This is estimated by ambari during the installation but in the end your responsibility. - The min size of a container. ( this is also the common divider of container sizes ) - the max size of a container ( normally yarn max is a good idea ) So lets assume you have a 3 node cluster with 32GB of RAM on each and yarn memory has been set to 24GB ( leaving 8 to OS plus HDFS ) Lets also assume your min container size is 1GB. This gives you 24GB * 3 = 72GB in total for yarn and at most 72 containers. A couple important things: - If you set your map settings to 1.5GB you have at most 36 containers since yarn only gives out slots in multiples of the minimum ( i.e. 2GB, 3GB 4GB, ... ) This is a common problem. So always set your container sizes as multiple of the min. -If you have only 16GB on the nodes and you set the yarn memory to 32GB, yarn will happily bring your system into outofmemory. It is your responsibility to configure it correctly so it uses the available RAM but not more What yarn does is to shoot down any task that uses more than its requested amount of RAM and to schedule tasks so they are running locally to data etc. pp.

bleonhardi · ‎02-05-2016

<action name="load"> <sqoop> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node>

bleonhardi · ‎02-04-2016

Saved that as PDF. I always wanted to look up how they work but followed through. Thanks a lot.

bleonhardi · ‎02-04-2016

I learned some things as well through the tutorial. If you want to verify your MapReduce knowledge the HDP Java developer certification is actually a good thing to do.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: can we apply the partitioning on the already e...

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: How to know the version of Ambari Server and A...

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: What is the difference between Partitioner, Co...

Re: What is the difference between Partitioner, Co...

Re: Resource Management in Yarn - Container pendin...

Re: Oozie sqoop job creation failing

Re: Auth-to-local Rules Syntax

Re: What is the difference between Partitioner, Co...