Good news! Long story short -- the problem solved and it was not the CDSW/Jupyter specific one 😉 A bit more verbose explanation. For me it looks clearly like Cloudera Manager bug, which should be addressed accordingly. In the process of investigation, one of my colleagues suggested to check if command-line pyspark shell is working correctly and apparently it wasn't. Checked from edge node, pyspark was able to start, but threw an error message: [email@example.com ~]$ pyspark File "/opt/miniconda3/envs/py36/lib/python3.6/site.py", line 177 file=sys.stderr) ^ SyntaxError: invalid syntax Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) Which told me, that Spark took the proper Python 3.6.9 executable from Miniconda installation as expected, but is unable to process its own libraries/modules and that was quite weird... Googling for the error gave me the advice to check if Pathon is not mixing 2.7 and 3.6 environments, in other words, despite reporting that PySpark is started in Python 3.6.9, the error message came from Python 2.7.5 (OS "native" one), so I started to look for spark-conf/spark-env.sh files and found that bottom part of these files is corrupt and looks like: export PYSPARK_DRIVER_PYTHON=/bin/python3 export PYSPARK_PYTHON=/bin/python3 export PYSPARK_DRIVER_PYTHON=/bin/python3 export PYSPARK_PYTHON=/bin/python3 Yes, exactly like this, 3 lines instead of 4 and one line concatenated from 2 equal blocks, but without extra new line between them. So I went to Cloudera Manager -> Spark -> Configuration and found, that both "extra fields", Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh were containing two lines each: export PYSPARK_DRIVER_PYTHON=/bin/python3 export PYSPARK_PYTHON=/bin/python3 On the worker nodes, /bin/python3 points to the proper Python 3.6.9 executable in Miniconda3 installation, so it would work fine if Cloudera Manager woud be able to glue both those sections together with just one newline inbetween. But it looks like those both "freetext" fields are simply used "as is", sticked together and glued at the bottom of the spark-env.sh file template. In our particular case it gave pyspark the strange mixture of system's default python 2.7 and conda's 3.6.9. So, at the end of the day we wiped out those "extras" in Cloudera Manager controlled Spark configuration and after restart everything ran smooth as expected.
... View more
Hi All, We're currently running CDH 6.3.2 and trying to get Python 3.6.9 working on Spark working nodes. So far without success, though... As our cluster is built on RHEL7, default Python version is 2.7, which is EOL and doesn't have all the necessary libraries/modules. Original idea was to install Miniconda3 on all the nodes, create the environment py36 and point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the proper python executable. Apparently this doesn't work as expected, any PySpark job breaks with error message: Fatal Python error: Py_Initialize: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings' Despite the fact that module "encodings" is out there in py36 environment. If we add spark.executorEnv.PYTHONPATH=/opt/miniconda3/envs/py36/lib/python3.6:/opt/miniconda3/envs/py36/lib/ to spark-defaults.conf file in the project root, we becoming different error message: Traceback (most recent call last):
File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 109, in _get_module_details
zipimport.ZipImportError: can't decompress data; zlib not available Which is also not correct, because zlib is installed. We tried to install and activate official Anaconda parcel for CDH, but it comes with Python 2.7, so the question at the end of the day is the same -- how to tell Spark worker node to use specific python version or virtual environment from Jupyter notebook started on CDSW? We've already found some guides in the Net explaining how to tell Spark to use conda, using set of spark.pyspark.virtualenv.* variables in spark-defaults.conf, but they don't seem to affect anything. This particular python version is officially recommended one for CDSW, so there should be the possibility to use it on Spark worker nodes as well... Thanx in advance, Kirill
... View more
Hi all, Recently we've ran into an issue with Kafka running in non-routable network segment behind the proxy (HAproxy, to be precise). We're using Ambari 2.7.3 (as it is the most recent stable version for Power platform). Kafka has 3 broker nodes, and they can talk to each other using FQDN hostnames. But to connect from real consumers/producers, HAproxy virtual IPs (in form of FQDNs) should be used, one per each broker. If advertised.listeners property is not populated with those external VIPs, consumers/producers cannot work properly, because Kafka advertises only internal hostnames which are not visible from outside. Setting advertised.listeners to proper values is not a big deal with standalone Kafka deployment, but with Ambari-controlled one it doesn't work, because Ambari has no built-in logic to map detected hostnames of listeners to their external names and/or IPs. Even worse, there is a possibility to add advertised.listeners property to "Custom kafka-broker" configuration section using Ambari UI, but it is useless in case of external FQDNs/IPs. Editing server.properties in-place on each Kafka broker node helps, but in this case Kafka has to be (re-)started using its own startup scripts and can no longer be controlled by Ambari (will be shown "red" in Ambari UI), otherwise Ambari will overwrite the server.properties with own values. To add necessary functionality and keep Kafka under Ambari control we've modified couple of Ambari python files (service_advisor.py and kafka.py) and added custom property called listeners.mapping containing list of pairs <internal_listener>:<external_listener>. Everything is working fine now and we could contribute to the upstream, but looks like exactly 2.7.3 Ambari branch on GitHub is kinda frozen and no one takes care of Pull Requests. So where to go with our solution, the issue seem to be pretty common, especially in cloud setups, when Kafka runs inside the VPC? Cheers, Kirill
... View more
Forgot to update. I solved the issue, implementing quick-n-dirty
workaround: I directly edited index.html in application directory
getting rid of external references and copying over bootstrap's and
jquery's JS/CSS files to the sub-folder under the same directory. After
that Knox service has to be *stopped and started*, 'restart' is not
enough to pick up the changes.
... View more
Looks like it is fixed in upstream not so long time ago, all the versions including latest official 1.2.0 have this issue. It has been fixed about 16 days ago, as I found in Knox sources.
... View more
We do have a HDP/HDF cluster running in the network segment, where neither servers nor clients have no access to the public internet. Initial setup (both manually via Ambari or using Blueprint) has worked well (with extra steps necessary to use local RPM repositories instead of public ones). Now cluster seems to be operational (with some workarounds due to hostnames in *.local domain and absence of proper SSL-certificates). The only major problem at the moment seems to be with Knox UI. Despite having all the necessary files locally at the Knox node (same machine as Ambari), Knox UI tries to pull CSS and JS files from external hosts, googleapis.com and bootstrapcdn.com respectfully. It is NOT about other component UIs, it is about Knox itself. I'm able to login with default username/password, but after a timeout I see an empty page with Knox logo. Switching on the Inspector/Debugger in the browser shows, that browser tries to pull bootstrap.min.css and jquery.min.js from their external hosts and (obviously) fails, because neither servers nor client machines have access to public Net. Other UIs (including Ambari) are working fine, at the moment Knox is the only component which has problems running in the isolated environment. Upd: Versions used: Ambari: 2.7.1 Knox: 1.0.0
... View more