Member since
04-13-2016
80
Posts
12
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3175 | 03-17-2017 06:10 PM |
08-02-2016
11:41 AM
I am seeing behaviour where I open a notebook and run a paragraph script in Zeppelin and the script executes and returns the result as expected. Then I leave the notebook open for a few hours - come back and attempt to re-run the same paragraph and get the error: "object 'hiveContext' not found". The only way to resolve this is by restarting the Spark interpreter. Is this expected behaviour? Is there a timeout on the hiveContext perhaps? Any insight much appreciated
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Apache Zeppelin
07-27-2016
04:09 PM
Hi All, I notice that each time I run a Spark script through Zeppelin it utilises the full amount of YARN memory available in my cluster. Is there a way that I can limit / manage memory consumption? Irrelevant of the type of job demands it seems to always use 100% of the cluster. Thanks,M
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
-
Apache Zeppelin
07-27-2016
11:16 AM
The cluster is fairly small as its mostly experimental but I have 3 out of the 4 nodes in the cluster that each have 4 vCores and 1GB of memory, with a global YARN minimum memory container size of 256MB. So when you say slots I'm assuming that would translate into 12 slots/containers potentially? i.e. a container representing 1vCore + 256MB. I had assumed that for the resource (CPU/RAM) available in my cluster that the query I'm running on the dataset sizes I'm working with i..e 30-40k records would be more than enough?
... View more
07-13-2016
06:45 PM
That all makes sense and I've adjusted the query as suggested. Although, it seems like the delay has now shifted to the reducer. As shown below it now shows reducer 4 processing 700 tasks but has been running for hours and the tez job is using 80% of the cluster capacity. I'm running a simplified query also to test performance: SELECT A.name, A.uid FROM (SELECT name, uid, latitude, longitude FROM ATable SORT BY asset_id) A JOIN B WHERE ST_Within(ST_SetSRID(ST_Point(B.longitude, B.latitude),4326), ST_SetSRID(ST_Buffer(ST_Point(A.longitude, A.latitude), 0.0005),4326));
I'm wondering if this is a problem with the spatial functions perhaps? This is the EXPLAIN output of the query:
... View more
07-13-2016
08:55 AM
Thanks both for your insight- I'm a noob in understanding the implications of a hive query and understanding the way in which mappers and reducers are formed. It the fundamental problem that the tables are in 1 file of relatively small data size, and therefore Tez does not initialise multiple mappers in parallel? Also could you provide a little more explanation around the use of SORT BY and how this improves performance. I'm planning on running some benchmark tests today to compare processing time. The job did finally complete in 3 hours 22 minutes 😕
... View more
07-12-2016
11:03 AM
1 Kudo
Hi All, I have the following hive query run on Tez/YARN involving two hive tables A and B: select
PreQuery.name,
sum(case when PreQuery.Geode < 10.0 then 1 else 0 end) 10mCount,
sum(case when PreQuery.Geode < 50.0 then 1 else 0 end) 50mCount,
sum(case when PreQuery.Geode < 1000.0 then 1 else 0 end) 100mCount
from
( select
a.name,
ST_GeodesicLengthWGS84( ST_SetSRID( ST_LineString(a.lat, a.lon, b.lat, b.lon),4326)) as Geode
from a, b) PreQuery
GROUP BY
PreQuery.name
ORDER by
1000mCount desc Table A has 45,000 rows, numFiles=1 as a managed ORC file and totalDataSize=1423246. Table B has 54,000 rows with a totalDataSize=11876624 and numFiles=1 stored as a managed TEXT FILE. My HDP2.4 cluster has 3 nodes providing a total of 12vCores with a minimum allocation of 256MB (1vCore) and a maximum allocation of 1048MB (4 vCores) YARN container(s). there is no bucketing or partitioning on these tables. My main Hive/Tez settings are: yarn.scheduler.minimum-allocation-mb=256MB
yarn.scheduler.maximum-allocation-mb=1048MB
hive.tez.container.size=512MB
hive.tez.java.opts=-server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps
mapreduce.map.memory.mb=256MB
mapreduce.map.java.opts=-Xmx204m
mapreduce.reduce.memory.mb=512MB
mapreduce.reduce.java.opts=-Xmx409m
tez.runtime.io.sort.mb=68MB
hive.auto.convert.sortmerge.join=true
hive.auto.convert.sortmerge.join.to.mapjoin=false
When I run the query I see the following output in the DaG graphical view: Mapper 4 completes instantly but then Mapper 1 says its running, but its been running for hours and nothing is progressing. The DAG indicates that it is running at 20% with no clear sign of progress. I'm running through query through the beeline command tool, the latest outputs: "INFO : Map 1: 0(+1)/1Map 4: 1/1Reducer 2: 0/2Reducer 3: 0/1". Looking at the YARN UI manager the job created only requests 2 containers from the underlying cluster using 512MB of memory per container. So my general question is how can I optimise this to run faster? Is it an issue with the query itself, the way the tables are setup or is more resource required in the platform. I feel like given the small volume of data im processing the available resource in the cluster should be more than enough which leads me to believe I'm doing something very wrong or the config is not correct. Any help much appreciated, M
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Tez
05-24-2016
04:12 PM
p.s - am I correct in assuming that once this is setup I can install Ranger and sync user and group accounts to it and leverage its management UI to add/remove users on the LDAP server?
... View more
05-24-2016
04:10 PM
Thanks for the information - so it seems like the simplest approach is to install openLDAP on one of my nodes and configure hiveserver2 to authenticate login requests against it. I'm looking for the most straight forward / quickest approach to secure Hive therefore leaving out Kerberos for the time being seems like the best plan?
... View more
05-24-2016
12:14 PM
..also my other question is do I need to setup a LDAP server in order to manage users/groups from a centralized service like Ranger? or can simply manage end-users and service users in Ranger alone.
... View more
05-24-2016
11:34 AM
Hi All, I have a 5-node HDP2.4 cluster running several services including Hive (HiveServer2). I want to now setup user authentication in Hive using LDAP. There is a lot of confusing information/tutorials that mention Ranger/Knox that can be used as an LDAP server but are often discussed in the context of a sandbox (development) environment. Could anyone offer some clear guidance/steps on how to setup Ranger/Knox so that I can create and manage user access (authentication/authorisation) to Hive in my cluster? Thanks, MPH p.s - Ive looked at this tutorial but it seems to confuse matters (http://hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/) 😕
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Knox
-
Apache Ranger