About ledel

ledel · ‎08-11-2021

Some people use the Boto3 library to browse their Amazon bucket from Python, and I was searching the same for Azure. This is far from being optimized, but it could be a starting point. First things first, we need to find the Azure access token. Notice those keys are supposed to rotate so you have to have that in mind. In Azure portal, let's get to the Storage Account in the ResourceGroup defined for your account, and click on Access keys There are two keys (for rotation without interruption), let's copy the first one. In my CML project, I'm defining an AZURE_STORAGE_TOKEN environment variable with that key: As you see above, 'STORAGE' variable has been populated. If you want it to be automatically populated, here's some code: !pip3 install git+https://github.com/fletchjeff/cmlbootstrap#egg=cmlbootstrap from cmlbootstrap import CMLBootstrap # Instantiate API Wrapper cml = CMLBootstrap() # Set the STORAGE environment variable try : storage=os.environ["STORAGE"] except: storage = cml.get_cloud_storage() storage_environment_params = {"STORAGE":storage} storage_environment = cml.create_environment_variable(storage_environment_params) os.environ["STORAGE"] = storage Now the project! Install the required libraries: pip3 install azure-storage-file-datalake Here is the code listing files on the "datalake" path. This is not handling all exceptions and so on, that's really a starting point only and not meant to be used in a production environment. !pip3 install azure-storage-file-datalake import os, uuid, sys, re from azure.storage.filedatalake import DataLakeServiceClient from azure.core._match_conditions import MatchConditions from azure.storage.filedatalake._models import ContentSettings def initialize_storage_account(storage_account_name, storage_account_key): try: global service_client service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format( "https", storage_account_name), credential=storage_account_key) except Exception as e: print(e) def list_directory_contents(path): try: file_system_client = service_client.get_file_system_client(container) paths = file_system_client.get_paths(path) for path in paths: print(path.name) except Exception as e: print(e) storage = os.environ['STORAGE'] storage_account_key = os.environ['AZURE_STORAGE_TOKEN'] m = re.search('abfs://(.+?)@(.+?)\.dfs.core\.windows\.net', storage) if m: container = m.group(1) storage_name = m.group(2) initialize_storage_account(storage_name, storage_account_key) list_directory_contents("datalake") Happy browsing!

ledel · ‎09-22-2020

In Cloudera Machine Learning experience (or CDSW for the on-prem version), projects are backed with git. You might want to use GitHub on your projects, so here is a simple way to do that. First things first: there are basically two ways of interacting with git/GitHub: HTTPS or SSH; We'll use the latter to make the authentication easy. You might also consider SSO or 2FA for enhancing security, here we'll focus on the basics. To make this authentication going on under the hood, copy our SSH key from CML to Github. Find your SSH key in the Settings of CML: Copy that key and add it in Github, under the SSH and GPG keys in your github.com settings: Add SSH key. Put cdsw in the Title and paste your ssh content in the Key: Let's start with creating a new project on github.com: The important thing here is the access mode we want to use: SSH In CML, start a new project with a template: Open a Terminal window in a new session: Convert the project to a git project: cdsw@qp7h1qllrh9dx1hd:~$ git init Initialized empty Git repository in /home/cdsw/.git/ Add all files to git: cdsw@qp7h1qllrh9dx1hd:~$ git add . Commit of the project in GitHub: cdsw@qp7h1qllrh9dx1hd:~$ git commit -m "initial commit" [master (root-commit) 5d75525] initial commit 47 files changed, 14086 insertions(+) create mode 100755 .gitignore create mode 100644 LICENSE.txt create mode 100755 als.py [...] Add a remote origin server with the "URL" of the remote repository where your local repository will be pushed: cdsw@qp7h1qllrh9dx1hd:~$ git remote add origin git@github.com:laurentedel/MyProject.git Make the current Git branch a master branch: cdsw@qp7h1qllrh9dx1hd:~$ git branch -M master Finally, push the changes (so all files for the first commit) to our master, so on github.com: cdsw@qp7h1qllrh9dx1hd:~$ git push -u origin master The authenticity of host 'github.com (140.82.113.4)' can't be established. RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'github.com,140.82.113.4' (RSA) to the list of known hosts. Counting objects: 56, done. Delta compression using up to 16 threads. Compressing objects: 100% (46/46), done. Writing objects: 100% (56/56), 319.86 KiB | 857.00 KiB/s, done. Total 56 (delta 1), reused 0 (delta 0) remote: Resolving deltas: 100% (1/1), done. To github.com:laurentedel/MyProject.git * [new branch] master -> master Branch 'master' set up to track remote branch 'master' from 'origin'. There you go! Now we can use the git commands are used to Modify file(s): cdsw@qp7h1qllrh9dx1hd:~$ echo "# MyProject" >> README.md What's our status? cdsw@qp7h1qllrh9dx1hd:~$ git status On branch master Your branch is up to date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) README.md nothing added to commit but untracked files present (use "git add" to track) Commit/push: cdsw@qp7h1qllrh9dx1hd:~$ git add README.md cdsw@qp7h1qllrh9dx1hd:~$ git commit -m "adding a README" [master 7008e88] adding a README 1 file changed, 1 insertion(+) create mode 100644 README.md cdsw@qp7h1qllrh9dx1hd:~$ git push -u origin master Warning: Permanently added the RSA host key for IP address '140.82.114.4' to the list of known hosts. Counting objects: 3, done. Delta compression using up to 16 threads. Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), 290 bytes | 18.00 KiB/s, done. Total 3 (delta 1), reused 0 (delta 0) remote: Resolving deltas: 100% (1/1), completed with 1 local object. To github.com:laurentedel/MyProject.git 5d75525..7008e88 master -> master Branch 'master' set up to track remote branch 'master' from 'origin'. Happy commits!

ledel · ‎12-03-2018

This article has been set on a HDP 2.5.3 version, you may consider adjusting some parameters to reflect your actual version. We'll here set Kafka loglevel through the Logging MBean with jConsole. For that, the first step is to enable JMX access: add in Kafka configs/kafka-env template export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false" export JMX_PORT="9999" For avoiding JMX port conflicts like mentioned in https://community.hortonworks.com/articles/73750/kafka-jmx-tool-is-failing-with-port-already-in-use.html, let’s modify /usr/hdp/current/kafka/bin/kafka-run-class.sh on all broker nodes: replace # JMX port to use if [ $JMX_PORT ]; then KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT" fi with # JMX port to use if [ $ISKAFKASERVER = "true" ]; then JMX_REMOTE_PORT=$JMX_PORT else JMX_REMOTE_PORT=$CLIENT_JMX_PORT fi if [ $JMX_REMOTE_PORT ]; then KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_REMOTE_PORT" fi After brokers has been restarted, lets modify the logLevel with jConsole: $ jconsole <BROKER_FQDN>:<JMX_PORT> It launches a jconsole window, asking for Retry insecurely, go ahead with that go to the Mbeans tab then Kafka/kafka.log4jController/Attributes, and double-click on the Value of Loggers to get all Log4j controllers You can see the kafka logger above those presented is set to INFO. We can check it using the getLogLevel Operations entering the kafka loggerName Fortunately, you can also set the value without restarting with the setLogLevel operation, putting in DEBUG or TRACE for example.

ledel · ‎10-17-2017

When starting spark-shell, it tries to bind to port 4040 for the SparkUI. If that port is already taken because of another spark-shell session active, it tries then to bind on 4041, then 4042, etc. Each time the binding didn't suceed, there's a huge WARN stack trace which could be filtered [user@serv hive]$ SPARK_MAJOR_VERSION=2 spark-shell SPARK_MAJOR_VERSION is set to 2, using Spark2 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/20 11:49:43 WARN AbstractLifeCycle: FAILED ServerConnector@2d258eff{HTTP/1.1} {0.0.0.0:4040}: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) at org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$newConnector$1(JettyUtils.scala:333) at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$httpConnect$1(JettyUtils.scala:365) at org.apache.spark.ui.JettyUtils$$anonfun$7.apply(JettyUtils.scala:368) at org.apache.spark.ui.JettyUtils To filter that stacktrace, let's put that class log4j verbosity in ERROR level in /usr/hdp/current/spark2-client/conf/log4j.properties # Added for not having stack traces when binding to SparkUI log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR

ledel · ‎06-20-2017

adapted 2 requests for Postgres (and added vacuum): CREATE TEMPORARY TABLE tmp_request_id AS SELECT MAX(request_id) AS request_id FROM request WHERE create_time <= (SELECT (EXTRACT(epoch FROM NOW()) - 2678400) * 1000 as epoch_1_month_ago_times_1000); CREATE TEMPORARY TABLE tmp_task_id AS SELECT MAX(task_id) AS task_id FROM host_role_command WHERE request_id <= (SELECT request_id FROM tmp_request_id); CREATE TEMPORARY TABLE tmp_upgrade_ids AS SELECT upgrade_id FROM upgrade WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM execution_command WHERE task_id <= (SELECT task_id FROM tmp_task_id); DELETE FROM host_role_command WHERE task_id <= (SELECT task_id FROM tmp_task_id); DELETE FROM role_success_criteria WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM stage WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM topology_logical_task; DELETE FROM requestresourcefilter WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM requestoperationlevel WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM upgrade_item WHERE upgrade_group_id IN (SELECT upgrade_group_id FROM upgrade_group WHERE upgrade_id IN (SELECT upgrade_id FROM tmp_upgrade_ids)); DELETE FROM upgrade_group WHERE upgrade_id IN (SELECT upgrade_id FROM tmp_upgrade_ids); DELETE FROM upgrade WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM request WHERE request_id <= (SELECT request_id FROM tmp_request_id); DELETE FROM topology_host_task; DELETE FROM topology_host_request; DELETE FROM topology_logical_request; DELETE FROM topology_host_info; DELETE FROM topology_hostgroup; DELETE FROM topology_request; DROP TABLE tmp_upgrade_ids; DROP TABLE tmp_task_id; DROP TABLE tmp_request_id; VACUUM FULL VERBOSE ANALYZE;

ledel · ‎06-15-2017

@Guillaume Roger you can set doas to true but it's advised not to for some reasons. For example, when setting doAs to false then HS2 is able to share resources you're encouraged to use Ranger instead of file-based permissions (SQL permissions are available and translated to Ranger rules when activated), it's more secure and you can even set permissions to a column level.

ledel · ‎06-15-2017

@Guillaume Roger partition is viewed as a new column in the table definition hence you can't partition by an already existing field. as a side note, PK is not working as in standard SGBD in Hive, it's just here for compliance (ie you can't deduplicate fields just by adding a PK)

ledel · ‎06-13-2017

you can launch hive in a debug session mode to have more info hive --hiveconf hive.root.logger=INFO,console

ledel · ‎06-13-2017

did you changed your fs.defaultFS in HDFS configuration, by any chance ?

ledel · ‎06-13-2017

@Nikkie Thomas you can specify the number of reducers for a query : hive> set mapreduce.job.reduces=1;

Online	Offline
Last Visited	‎03-17-2025 04:06 PM

Member Since	‎09-17-2015 08:39 PM
Last Visited	‎03-17-2025 04:06 PM
Posts	103
Kudos received	63

Cloudera Community

Re: hive-interactive and hive.server2.enable.doas

Re: Hive primary on a partitioned column

Re: How can I download sandboxes from past release...

Re: How many job.properties per workflow, coordin...

Re: running/scheduling an Oozie job(s) with mapred...

Browse ADSL storage in CML

How to work with Github repositories in CML/CDSW

How to change Kafka log level on the fly

Spark: avoiding stack traces when starting spark-s...

Re: Steps to Purge Ambari server operational data

Re: hive-interactive and hive.server2.enable.doas

Re: Hive primary on a partitioned column

Re: Tez Engine is failing while running hive query...

Re: NameNode error: No FileSystem for scheme: http

Re: Hive Multiple Small Files