Member since
07-11-2019
102
Posts
4
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
18194 | 12-13-2019 12:03 PM | |
4273 | 12-09-2019 02:42 PM | |
3126 | 11-26-2019 01:21 PM | |
1431 | 08-27-2019 03:03 PM | |
2728 | 08-14-2019 07:33 PM |
05-13-2020
02:02 AM
Running a HortonWork hadoop cluster (HDP-3.1.0.0) and getting a bunch of Failed on local exception: java.io.IOException: Too many open files errors when running spark jobs that up until this point have worked fine. I have seen many other questions like this where the answer is to increase the ulimit settings for open files and processes (this is also in the HDP docs) (and I'll note that I believe that mine are still at the system default settings), but... My question is: Why is this only happening now when previously the spark jobs have been running fine for months? The spark jobs I have been running have been running fine for months without incident and I have made no recent code changes. Don't know enough about the internals of spark to theorize about why things could be going wrong only now (would be odd to me if open files just build up in the course of running spark, but that seems like what is happening). Just as an example, just this code... .
.
.sparkSession = SparkSession.builder.appName("GET_TABLE_COUNT").getOrCreate()sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size()
.
.
. now generates errors like... .
.
.
[2020-05-12 19:04:45,810] {bash_operator.py:128} INFO - 20/05/12 19:04:45 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:46,813] {bash_operator.py:128} INFO - 20/05/12 19:04:46 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:47,816] {bash_operator.py:128} INFO - 20/05/12 19:04:47 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:48,818] {bash_operator.py:128} INFO - 20/05/12 19:04:48 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:49,820] {bash_operator.py:128} INFO - 20/05/12 19:04:49 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:50,822] {bash_operator.py:128} INFO - 20/05/12 19:04:50 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:51,828] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: Application report for application_1579648183118_19918 (state: FAILED)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client:
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - client token: N/A[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - diagnostics: Application application_1579648183118_19918 failed 2 times due to Error launching appattempt_1579648183118_19918_000002. Got exception: java.io.IOException: DestHost:destPort hw005.co.local:45454 , LocalHost:localPort hw001.co.local/172.18.4.46:0. Failed on local exception: java.io.IOException: Too many open files [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.GeneratedConstructorAccessor808.newInstance(Unknown Source)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) My RAM and ulimit setting on the cluster look like... [root@HW001]# clush -ab free -h---------------HW001--------------- total used free shared buff/cache availableMem: 31G 9.0G 1.1G 1.7G 21G 19G
Swap: 8.5G 44K 8.5G
---------------HW002--------------- total used free shared buff/cache availableMem: 31G 7.3G 5.6G 568M 18G 22G
Swap: 8.5G 308K 8.5G
---------------HW003--------------- total used free shared buff/cache availableMem: 31G 6.1G 4.0G 120M 21G 24G
Swap: 8.5G 200K 8.5G
---------------HW004--------------- total used free shared buff/cache availableMem: 31G 2.9G 2.8G 120M 25G 27G
Swap: 8.5G 28K 8.5G
---------------HW005--------------- total used free shared buff/cache availableMem: 31G 2.9G 4.6G 120M 23G 27G
Swap: 8.5G 20K 8.5G
---------------airflowetl--------------- total used free shared buff/cache availableMem: 46G 5.3G 13G 2.4G 28G 38G
Swap: 8.5G 124K 8.5G
[root@HW001]#
[root@HW001]#
[root@HW001]#
[root@HW001]# clush -ab ulimit -a ---------------HW[001-005] (5)--------------- core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0 file size (blocks, -f) unlimited
pending signals (-i) 127886 max locked memory (kbytes, -l) 64max memory size (kbytes, -m) unlimited
open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited
max user processes (-u) 127886 virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited ---------------airflowetl--------------- core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0 file size (blocks, -f) unlimited
pending signals (-i) 192394 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited
open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited
max user processes (-u) 192394 virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited Don't know much about Hadoop admin, but just looking at the Ambari dashboard, the cluster does not seem to be overly taxed... (though could not actually check the RM web UI, since it just throws a "too many open files" error). Anyone with more spark/hadoop experience know why this would be happening now?
... View more
Labels:
12-23-2019
04:19 PM
[The following question was moved here after originally being posted 12-23-2019 to this thread which was marked 'Solved' 06-01-2017 12:49 AM —Moderator]
@jsensharma Could you explain a little more about what exactly this
hadoop.proxyuser.root.groups
config is? Any docs describing it more?
... View more
Labels:
- Labels:
-
Apache Hadoop
12-13-2019
12:03 PM
@jsensharma 1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option). 2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just from os import environ
import time
import pprint
import platform
pp = pprint.PrettyPrinter(indent=4)
sparkSession = SparkSession.builder.appName("TEST").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
print(platform.python_version())
def testfunc(num: int) -> str:
return "type annotations look ok"
print(testfunc(1))
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
pp.pprint(sparkSession.sparkContext._conf.getAll()) but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use export HADOOP_CONF_DIR=/etc/hadoop/conf for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path... [airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop-
hadoop-client/ hadoop-httpfs
hadoop-hdfs-client/ hadoop-mapreduce-client/
hadoop-hdfs-datanode/ hadoop-mapreduce-historyserver/
hadoop-hdfs-journalnode/ hadoop-yarn-client/
hadoop-hdfs-namenode/ hadoop-yarn-nodemanager/
hadoop-hdfs-nfs3/ hadoop-yarn-registrydns/
hadoop-hdfs-portmap/ hadoop-yarn-resourcemanager/
hadoop-hdfs-secondarynamenode/ hadoop-yarn-timelinereader/
hadoop-hdfs-zkfc/ hadoop-yarn-timelineserver/
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop
ls: cannot access /usr/hdp/current/hadoop: No such file or directory (note I am using HDP v3.1.0)
... View more
12-12-2019
03:35 PM
Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...
[me@myserver tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py
File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?ls: cannot access /usr/hdp//hadoop/lib: No such file or directoryException in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
at org.apache.spark.launcher.Main.main(Main.java:118)
# also tried...
(venv) [me@myserver tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
.... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.
Anyone with more experience have any further debugging tips for this or fixes?
... View more
Labels:
- Labels:
-
Apache Spark
12-11-2019
11:30 AM
1 Kudo
From the Ranger email list, this is another bit of information that I found helpful: ---------- I’ve configured ranger using the following approach to control who must be synced with AD. Only users belonging to groups inside a specific OU will be synced. I’ve created the OU OU=ArthurAmericasGroups,OU=Security Groups,OU=Groups,OU=SHARED,OU=Brazil,DC=domain,DC=com Create a group called R2Users inside that OU. I put all desired sync users as its members. Also, you can put other groups as its member. And, you can create other groups like R2TEAM as well. Remember to update this property ranger.usersync.ldap.user.searchfilter to include more than one. I’ve configured ranger to sync groups before users. Here is the configuration. in COMMON CONFIGS Label Property Value LDAP/AD URL ranger.usersync.ldap.url ldap://myacticedirectoryserver.domain.com:389 Bind User ranger.usersync.ldap.binddn CN=LDAP_AD_ACCOUNT,OU=Service Accounts,OU=LCB,OU=Brazil,DC=domain,DC=com Bind User Password ranger.usersync.ldap.ldapbindpassword LDAP_AD_ACCOUNT user’s password Inclemetal Sync ranger.usersync.ldap.deltasync Yes Enable LDAP STARTTLS ranger.usersync.ldap.starttls No GROUP CONFIGS Label Property Value Enable Group Sync ranger.usersync.group.searchenable Yes Group Member Attribute ranger.usersync.group.memberattributename member Group Name Attribute ranger.usersync.group.nameattribute Cn Group Object Class ranger.usersync.group.objectclass Group Group Search Base ranger.usersync.group.searchbase OU=ArthurAmericasGroups,OU=Security Groups,OU=Groups,OU=SHARED,OU=Brazil,DC=domain,DC=com Group Search Filter ranger.usersync.group.searchfilter Enable Group Search First ranger.usersync.group.search.first.enabled Yes Sync Nested Groups is_nested_groupsync_enabled Yes Group Hierarchy Levels ranger.usersync.ldap.grouphierarchylevels 5 USER CONFIGS Label Property Value Username Attribute ranger.usersync.ldap.user.nameatributte sAMAccountName User Object Class ranger.usersync.ldap.objectclass User User Search Base ranger.usersync.ldap.searchbase DC=domain,DC=com User Search Filter ranger.usersync.ldap.user.searchfilter (memberOf=CN=R2Users,OU=ArthurAmericasGroups,OU=Security Groups,OU=Groups,OU=SHARED,OU=Brazil,DC=domain,DC=com) User Search Scope ranger.usersync.ldap.user.searchscope Sub User Group Name Attribute ranger.usersync.ldap.groupnameattribute sAMAccountName Group User Map Sync ranger.usersync.group.usermapsyncenabled Yes Enable User Search ranger.usersync.user.searchenabled Yes ADVANCED Ranger Settings Label Property Value Authentication method ACTIVE_DIRECTORY AD Settings Label Property Value AD Bind Password ranger.ldap.ad.bind.password LDAP_AD_ACCOUNT user’s password Domain Name (Only for AD) anger.ldap.ad.domain DC=domain,DC=com AD Base DN ranger.ldap.ad.base.dn DC=domain,DC=com AD Referreal ranger.ldap.ad.referreal Follow AD User Serach Filter ranger.ldap.ad.user.search (sAMAccountName={0}) Advanced ranger-ugsync-site Label Property Value ranger.usersync.ldap.referral ranger.usersync.ldap.referral Follow
... View more
12-09-2019
02:43 PM
Saving this for later. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/ambari-authentication-ldap-ad/content/setting_up_ldap_user_authentication.html
... View more
12-09-2019
02:42 PM
1 Kudo
Appears to have been able to sync AD users after changing the bing user path to: CN=hwldap,OU=Users,OU=HortonworksUsers,DC=ucera,DC=local as opposed to using the "uid" entry key. IDK why this would make a difference, but seems to have worked. Would anyone with more AD experience have an idea why (note that when I look at the attributes for this entry in our AD both the CN and UID attributes are present)?
... View more
12-06-2019
05:02 PM
Attempting to set LDAP/AD users for Ranger (v1.2.0) following the docs (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/configuring-ranger-authe-with-unix-ldap-ad/content/configuring_ranger_authentication_with_unix_ldap_or_ad.html) and this older video (https://www.youtube.com/watch?v=2aZ9GBhCOhA), but when looking at the Ranger Users tab in the Ranger UI, seeing only the original Unix users. In the UI I see... Looking at the usersync logs, near the tail I see: .... 06 Dec 2019 14:21:51 INFO LdapUserGroupBuilder [UnixUserSyncThread] - LdapUserGroupBuilder initialization started 06 Dec 2019 14:21:51 INFO LdapUserGroupBuilder [UnixUserSyncThread] - LdapUserGroupBuilder initialization completed with -- ldapUrl: ldap://172.18.4.42:389, ldapBindDn: UID=hwldap,OU=Users,OU=HortonworksUsers,DC=ucera,DC=local, ldapBindPassword: ***** , ldapAuthenticationMechanism: simple, searchBase: dc=hadoop,dc=apache,dc=org, userSearchBase: [dc=ucera,dc=local], userSearchScope: 2, userObjectClass: user, userSearchFilter: (memberOf=UID=hwusers,OU=groups,OU=HortonworksUsers,DC=ucera,DC=local), extendedUserSearchFilter: (&(objectclass=user)(memberOf=UID=hwusers,OU=groups,OU=HortonworksUsers,DC=ucera,DC=local)), userNameAttribute: sAMAccountName, userSearchAttributes: [sAMAccountName, memberof], userGroupNameAttributeSet: [memberof], pagedResultsEnabled: true, pagedResultsSize: 500, groupSearchEnabled: false, groupSearchBase: [dc=ucera,dc=local], groupSearchScope: 2, groupObjectClass: group, groupSearchFilter: (CN=hwusers), extendedGroupSearchFilter: (&(objectclass=group)(CN=hwusers)(|(cn={0})(cn={1}))), extendedAllGroupsSearchFilter: (&(objectclass=group)(CN=hwusers)), groupMemberAttributeName: cn, groupNameAttribute: UID=hwusers,OU=groups,OU=HortonworksUsers,DC=ucera,DC=local, groupSearchAttributes: [UID=hwusers,OU=groups,OU=HortonworksUsers,DC=ucera,DC=local, cn], groupUserMapSyncEnabled: true, groupSearchFirstEnabled: false, userSearchEnabled: false, ldapReferral: ignore 06 Dec 2019 14:21:51 INFO UserGroupSync [UnixUserSyncThread] - Begin: initial load of user/group from source==>sink 06 Dec 2019 14:21:51 INFO LdapUserGroupBuilder [UnixUserSyncThread] - LDAPUserGroupBuilder updateSink started 06 Dec 2019 14:21:51 INFO LdapUserGroupBuilder [UnixUserSyncThread] - Performing user search first 06 Dec 2019 14:21:51 ERROR LdapUserGroupBuilder [UnixUserSyncThread] - LDAPUserGroupBuilder.getUsers() failed with exception: javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09042F, comment: AcceptSecurityContext error, data 52e, v2580]; remaining name 'dc=ucera,dc=local' 06 Dec 2019 14:21:51 INFO LdapUserGroupBuilder [UnixUserSyncThread] - LDAPUserGroupBuilder.getUsers() user count: 0 06 Dec 2019 14:21:51 INFO UserGroupSync [UnixUserSyncThread] - End: initial load of user/group from source==>sink .... So it seems like Ranger is trying to use AD, encountering an error, and falling back to Unix based users. Did see this article (https://community.cloudera.com/t5/Community-Articles/Ranger-Ldap-Integration/ta-p/245494), but already have the cluster nodes linked to AD via SSSD, so would think the LDAP/AD sync should already be configured on the nodes and that Ranger should be able to use AD once the configs where entered. Any idea what is going on here? Any further debugging tips or information (very unfamiliar with AD/LDAP admin stuff)?
... View more
Labels:
- Labels:
-
Apache Ranger
11-26-2019
01:21 PM
After just giving in and trying to manually create the hive user myself, I see [root@airflowetl ~]# useradd -g hadoop -s /bin/bash hive useradd: user 'hive' already exists
[root@airflowetl ~]# cat /etc/passwd | grep hive
[root@airflowetl ~]# id hive uid=379022825(hive) gid=379000513(domain users) groups=379000513(domain users) The fact that this existing user's uid looks like this and is not in the /etc/passwd file made me think that there is some existing Active Directory user (which this client node syncs with via installed SSSD) that already has the name hive. Checking our AD users, this turned out to be true. Temporarily stopping the SSSD service to stop sync with AD (service sssd stop) (since, not sure if you can get a server to ignore AD syncs on an individual user basis) before rerunning the client host add in Ambari fixed the problem for me.
... View more
11-26-2019
12:14 PM
Adding some log printing lines near the offending final line in the error trace, ie. File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 147, in get_supported_packages, I print the code and stdout: 2 ambari-python-wrap: can't open file '/usr/bin/hdp-select': [Errno 2] No such file or directory So what the heck? It wants hdp-select to already be there, but ambari add-host UI complains if I manually install that binary myself beforehand. When I do manually install it (using the same repo file as in the rest of the existing cluster nodes) all I see is... 0
Packages: accumulo-client
accumulo-gc
accumulo-master
accumulo-monitor
accumulo-tablet
accumulo-tracer
atlas-client
atlas-server
beacon
beacon-client
beacon-server
druid-broker
druid-coordinator
druid-historical
druid-middlemanager
druid-overlord
druid-router
druid-superset
falcon-client
falcon-server
flume-server
hadoop-client
hadoop-hdfs-client
hadoop-hdfs-datanode
hadoop-hdfs-journalnode
hadoop-hdfs-namenode
hadoop-hdfs-nfs3
hadoop-hdfs-portmap
hadoop-hdfs-secondarynamenode
hadoop-hdfs-zkfc
hadoop-httpfs
hadoop-mapreduce-client
hadoop-mapreduce-historyserver
hadoop-yarn-client
hadoop-yarn-nodemanager
hadoop-yarn-registrydns
hadoop-yarn-resourcemanager
hadoop-yarn-timelinereader
hadoop-yarn-timelineserver
hbase-client
hbase-master
hbase-regionserver
hive-client
hive-metastore
hive-server2
hive-server2-hive
hive-server2-hive2
hive-webhcat
hive_warehouse_connector
kafka-broker
knox-server
livy-client
livy-server
livy2-client
livy2-server
mahout-client
oozie-client
oozie-server
phoenix-client
phoenix-server
pig-client
ranger-admin
ranger-kms
ranger-tagsync
ranger-usersync
shc
slider-client
spark-atlas-connector
spark-client
spark-historyserver
spark-schema-registry
spark-thriftserver
spark2-client
spark2-historyserver
spark2-thriftserver
spark_llap
sqoop-client
sqoop-server
storm-client
storm-nimbus
storm-slider-client
storm-supervisor
superset
tez-client
zeppelin-server
zookeeper-client
zookeeper-serverAliases: accumulo-server
all
client
hadoop-hdfs-server
hadoop-mapreduce-server
hadoop-yarn-server
hive-server
Command failed after 1 tries
... View more