About mph

mph · ‎08-02-2016

I am seeing behaviour where I open a notebook and run a paragraph script in Zeppelin and the script executes and returns the result as expected. Then I leave the notebook open for a few hours - come back and attempt to re-run the same paragraph and get the error: "object 'hiveContext' not found". The only way to resolve this is by restarting the Spark interpreter. Is this expected behaviour? Is there a timeout on the hiveContext perhaps? Any insight much appreciated

mph · ‎07-27-2016

Hi All, I notice that each time I run a Spark script through Zeppelin it utilises the full amount of YARN memory available in my cluster. Is there a way that I can limit / manage memory consumption? Irrelevant of the type of job demands it seems to always use 100% of the cluster. Thanks,M

mph · ‎07-27-2016

The cluster is fairly small as its mostly experimental but I have 3 out of the 4 nodes in the cluster that each have 4 vCores and 1GB of memory, with a global YARN minimum memory container size of 256MB. So when you say slots I'm assuming that would translate into 12 slots/containers potentially? i.e. a container representing 1vCore + 256MB. I had assumed that for the resource (CPU/RAM) available in my cluster that the query I'm running on the dataset sizes I'm working with i..e 30-40k records would be more than enough?

mph · ‎07-13-2016

That all makes sense and I've adjusted the query as suggested. Although, it seems like the delay has now shifted to the reducer. As shown below it now shows reducer 4 processing 700 tasks but has been running for hours and the tez job is using 80% of the cluster capacity. I'm running a simplified query also to test performance: SELECT A.name, A.uid FROM (SELECT name, uid, latitude, longitude FROM ATable SORT BY asset_id) A JOIN B WHERE ST_Within(ST_SetSRID(ST_Point(B.longitude, B.latitude),4326), ST_SetSRID(ST_Buffer(ST_Point(A.longitude, A.latitude), 0.0005),4326)); I'm wondering if this is a problem with the spatial functions perhaps? This is the EXPLAIN output of the query:

mph · ‎07-13-2016

Thanks both for your insight- I'm a noob in understanding the implications of a hive query and understanding the way in which mappers and reducers are formed. It the fundamental problem that the tables are in 1 file of relatively small data size, and therefore Tez does not initialise multiple mappers in parallel? Also could you provide a little more explanation around the use of SORT BY and how this improves performance. I'm planning on running some benchmark tests today to compare processing time. The job did finally complete in 3 hours 22 minutes 😕

mph · ‎07-12-2016

Hi All, I have the following hive query run on Tez/YARN involving two hive tables A and B: select PreQuery.name, sum(case when PreQuery.Geode < 10.0 then 1 else 0 end) 10mCount, sum(case when PreQuery.Geode < 50.0 then 1 else 0 end) 50mCount, sum(case when PreQuery.Geode < 1000.0 then 1 else 0 end) 100mCount from ( select a.name, ST_GeodesicLengthWGS84( ST_SetSRID( ST_LineString(a.lat, a.lon, b.lat, b.lon),4326)) as Geode from a, b) PreQuery GROUP BY PreQuery.name ORDER by 1000mCount desc Table A has 45,000 rows, numFiles=1 as a managed ORC file and totalDataSize=1423246. Table B has 54,000 rows with a totalDataSize=11876624 and numFiles=1 stored as a managed TEXT FILE. My HDP2.4 cluster has 3 nodes providing a total of 12vCores with a minimum allocation of 256MB (1vCore) and a maximum allocation of 1048MB (4 vCores) YARN container(s). there is no bucketing or partitioning on these tables. My main Hive/Tez settings are: yarn.scheduler.minimum-allocation-mb=256MB yarn.scheduler.maximum-allocation-mb=1048MB hive.tez.container.size=512MB hive.tez.java.opts=-server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps mapreduce.map.memory.mb=256MB mapreduce.map.java.opts=-Xmx204m mapreduce.reduce.memory.mb=512MB mapreduce.reduce.java.opts=-Xmx409m tez.runtime.io.sort.mb=68MB hive.auto.convert.sortmerge.join=true hive.auto.convert.sortmerge.join.to.mapjoin=false When I run the query I see the following output in the DaG graphical view: Mapper 4 completes instantly but then Mapper 1 says its running, but its been running for hours and nothing is progressing. The DAG indicates that it is running at 20% with no clear sign of progress. I'm running through query through the beeline command tool, the latest outputs: "INFO : Map 1: 0(+1)/1Map 4: 1/1Reducer 2: 0/2Reducer 3: 0/1". Looking at the YARN UI manager the job created only requests 2 containers from the underlying cluster using 512MB of memory per container. So my general question is how can I optimise this to run faster? Is it an issue with the query itself, the way the tables are setup or is more resource required in the platform. I feel like given the small volume of data im processing the available resource in the cluster should be more than enough which leads me to believe I'm doing something very wrong or the config is not correct. Any help much appreciated, M

mph · ‎05-24-2016

p.s - am I correct in assuming that once this is setup I can install Ranger and sync user and group accounts to it and leverage its management UI to add/remove users on the LDAP server?

mph · ‎05-24-2016

Thanks for the information - so it seems like the simplest approach is to install openLDAP on one of my nodes and configure hiveserver2 to authenticate login requests against it. I'm looking for the most straight forward / quickest approach to secure Hive therefore leaving out Kerberos for the time being seems like the best plan?

mph · ‎05-24-2016

..also my other question is do I need to setup a LDAP server in order to manage users/groups from a centralized service like Ranger? or can simply manage end-users and service users in Ranger alone.

mph · ‎05-24-2016

Hi All, I have a 5-node HDP2.4 cluster running several services including Hive (HiveServer2). I want to now setup user authentication in Hive using LDAP. There is a lot of confusing information/tutorials that mention Ranger/Knox that can be used as an LDAP server but are often discussed in the context of a sandbox (development) environment. Could anyone offer some clear guidance/steps on how to setup Ranger/Knox so that I can create and manage user access (authentication/authorisation) to Hive in my cluster? Thanks, MPH p.s - Ive looked at this tutorial but it seems to confuse matters (http://hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/) 😕

Online	Offline
Last Visited	‎07-16-2020 05:48 AM

Member Since	‎04-13-2016 05:05 PM
Last Visited	‎07-16-2020 05:48 AM
Posts	80
Kudos received	12

Cloudera Community

Re: Zeppelin error on restart

Does the HiveContext object expire in Zeppelin aft...

How can I limit the amount of YARN memory allocate...

Re: Hive query running on Tez contains a Mapper th...

Re: Hive query running on Tez contains a Mapper th...

Re: Hive query running on Tez contains a Mapper th...

Hive query running on Tez contains a Mapper that h...

Re: How to setup Hive Authentication in my cluster...

Re: How to setup Hive Authentication in my cluster...

Re: How to setup Hive Authentication in my cluster...

How to setup Hive Authentication in my cluster ? (...