About Royles

Royles · ‎10-17-2023

Install Install Rust from Install Rust This is interactive so needs to run in a terminal curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh Hello World Rust Hello World https://doc.rust-lang.org/book/ch01-02-hello-world.html Reload your session and terminal to update the PATH env Run rustc main.rs main.rs … Why Rust? Original Post from Stack Overflow: what is rust and why is it so popular Rust has been Stack Overflow’s most loved language for four years in a row, indicating that many of those who have had the opportunity to use Rust have fallen in love with it. However, the roughly 97% of survey respondents who haven’t used Rust may wonder, “What’s the deal with Rust?” Rust’s static typing does its best to get out of the programmer’s way while encouraging long-term maintainability. Rust gives you the choice of storing data on the stack or the heap and determines at compile time when memory is no longer needed and can be cleaned up. The biggest benefit Rust can provide compared to these languages is the borrow checker. The Rust experience is larger than a language specification and a compiler; many aspects of creating and maintaining production-quality software are treated as first-class citizens.

Royles · ‎10-04-2023

Possibly. Adding an additional CPU, may have license cost implications. A node (such as your CM host) can have upto 16 CCUs. A CCU is the total number of CPU cores / 6. So if the cores are reported across all installed CPUs in a node as 96 or less then no core based variable cost would be incurred. Note: Generally you take the total cores / memory across the whole cluster as part of the calculation so this provides some flexibility when adding a CPU to a single server.

Royles · ‎03-28-2022

in August 2021, we released CDP Private Cloud Base 7.1.7 LTS. This article will outline what that means. For each software release, we provide a version, General Availability (GA) date, and End of Service (EoS) date. These are all published and described here. CDP Private Cloud Base 7.1.7 LTS has a GA date of August 2021 and an EoS date of August 2025, providing a 4-year support window. We then offer two paths, Service Pack Train The Service Pack train is for customers who value a stable platform, service a lot of users or use-cases, and want a low risk, low costs in maintenance and operations. Rather than taking the new feature release, customers can instead take regular Cumulative Hotfix (CHF)*. These hotfixes are released pro-actively to address break-fix issues and CVEs. At certain pre-planned release date, the hotfixes will be rolled up into a Service Pack (e.g. SP1)**. * Cumulative Hotfix means each hotfix includes those that came before up to the initial release or prior service pack, if any. ** It is important to note that service packs need to be installed in order (for example, SP1 + SP2). Service pack releases will not change the config, depreciate APIs, change schemas or break backward compatibility. Feature Release Train The Feature Release train is for customers who want the latest features and capabilities available from Cloudera. There may be some additional overhead in operations and systems testing. From 7.1.7, you will have the option of upgrading to 7.1.8 when it is released. 7.1.8 will include new features and these features may require application re-testing. However, it does mean you get the latest features and innovations in the platform sooner. Moving between trains If you would like to move between the trains, you can do so when a release is marked LTS. Both the Support and Feature trains can be upgraded via hotfixes. Generally upgrading from the Support train to the Feature train will require an upgrade compatible base and associated hotfixes to be in place. Detailed information on supported upgrade paths and versions will be maintained online CDP Private Cloud Base Upgrade and CDP In-Place Upgrade Paths The main objective is to offer flexibility in managing upgrades, reduce overall customer costs in operations and system testing, while reducing the number of platform variations we maintain going forward. Update - CDP Base 7.1.7, Service Pack 1 was released on 31st March 2022 It is described as "The LTS release provides a stable version of the platform to be supported for 4 years. There will be periodic cumulative hotfixes released to address any vulnerabilities or bug fixes to ensure this version remains secure, however, there won’t be any major changes or new features that require significant testing. Customers looking to utilize the latest feature set and innovations that Cloudera releases on CDP should consider upgrading to our regular Feature Release cycle. Feature Releases are released more frequently and are typically supported for 18 months. Service Pack 1 contains all the hotfixes from previously released cumulative hotfixes as well as additional CVE, security updates, critical bug fixes, and minor certifications. This addresses 45 CVE, Security and Critical Bug Fixes."

Royles · ‎03-27-2022

I often get asked about how Cloudera counts a node, core or CCU and wanted to bring together some resources to help answer these questions. The first resource to consider is our pricing and licensing FAQ/ Updates. Please review this as it is regularly maintained as we receive inbound suggestions and provide clarifications. Pricing update We also provide a formal policy on metric which is captured here, this represents the formal definition of CCU, Nodes and cores. Cloudera Licensed Metrics Cloudera Compute Unit (CCU) The link above defines in some detail what variable compute and storage is and the use of the CCU term. I will not duplicate those details here. CCU is simply a unit of measure and I will defer to the updated document as the unit of measure definition can change as pricing policies change. Nodes Generally, we determine the number of nodes based on the count returned in Cloudera Manager. Under Administration > License, the number of nodes is listed and all nodes counted need to be covered by the license.

Royles · ‎03-11-2022

There are a few posts on this topic. The nearest to your need with a solution is https://community.cloudera.com/t5/Support-Questions/NiFi-JSON-Array-split/m-p/158842#M121233 The key being using the SplitJson processor the updated doc reference is https://nifi.apache.org/docs/nifi-docs/components/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.0/org.apache.nifi.processors.standard.SplitJson/index.html This will need to be used in combination with Reader > JsonPathReader > SplitJson > Writer

Royles · ‎03-10-2022

This short example outlines how to configure Cloudera Machine Learning to integrate with the Data lake, Hive, and Iceberg tables. Apache Iceberg has a number of qualities that make it very suitable for Machine Learning. It maintains the metadata for the data layout alongside the data; this means there are no heavy network demands on the main catalog. The main catalog in this case Hive Metastore maintains a lightweight set of references to the data on the Data lake. Iceberg also supports snapshots and time-travel so that data can be versioned and queried at a specific point in time. Enable parallel processing in Spark Setup user credentials and default paths to the Data lake. spark-defaults.conf spark.executor.instances 2 spark.executor.memory 1g spark.executor.cores 2 spark.hadoop.yarn.resourcemanager.principal christopherroyles spark.yarn.access.hadoopFileSystems s3a://demo-aws-go02/ Setup a Spark session Load the appropriate Iceberg JAR Add the required SQL extensions Configure the pluggable catalog Define it as Hive Set the paths for both the raw data on the Data lake and the table we will write to in Iceberg. example.py # ## 1.1 Datalake to Data Warehouse # Load the Raw data in CSV format from the Datalake # into the Data Warehouse, apply a suitable schema and snapshot #.config("spark.jars.packages","org.apache.iceberg:iceberg-spark3-runtime:0.12.1") from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("1.1 - Ingest") \ .config("spark.jars","/home/cdsw/libs/iceberg-spark3-runtime-0.9.1.1.13.317211.0-9.jar") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \ .config("spark.sql.catalog.spark_catalog.type","hive") \ .getOrCreate() dl_path = "s3a://demo-aws-go02/user/royles/flightdata.csv" dw_db = "spark_catalog.bronzeDB" dw_table = "flightdata" Setup the database and make sure it is placed into the correct catalog to benefit from the Iceberg features. # Use the Iceberg Catalog spark.sql('SHOW CURRENT NAMESPACE').show() spark.sql('SHOW DATABASES').show() spark.sql("CREATE DATABASE IF NOT EXISTS "+dw_db) spark.sql("USE "+dw_db") spark.sql('SHOW TABLES').show() Load some unstructured CSV data from the Data lake as files and infer a schema from the raw data. Make sure the column names are compatible with the data warehouse SQL syntax. """Read in raw data without a schema""" rawDF = spark.read \ .format("csv") \ .option("inferSchema","true") \ .option("header", "true") \ .option("delimiter",",") \ .option("quote", "\"") \ .option("escape", "\"") \ .load(dl_path) rawDF.printSchema() rawDF.show(2) """Need to normalise the schema""" for name in rawDF.schema.names: rawDF = rawDF.withColumnRenamed(name, name.replace(' ', '_')) Write the table to an Iceberg table, overwriting anything that is there in the table already. New data inserted will create new point-in-time versions. """Write the table out in iceberg+parquet format""" rawDF.write \ .mode("overwrite") \ .format("iceberg") \ .saveAsTable(dw_table) spark.sql('show tables').show() spark.sql('SELECT * FROM %s LIMIT 10'%(dw_table)).show() spark.read.format("iceberg").load('%s.%s.history'%(dw_db,dw_table)).show(20, False) spark.read.format("iceberg").load('%s.%s.snapshots'%(dw_db,dw_table)).show(20, False) spark.read.format("iceberg").load('%s.%s.files'%(dw_db,dw_table)).show(20, False) spark.stop() Some simple examples of selecting data at a point in time. from datetime import datetime # current date and time now = datetime.now() timestamp = datetime.timestamp(now) print("timestamp =", timestamp) # Timestamps can be tricky. Please make sure to round your timestamp as shown below. # Query using a point in time df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load('%s.%s'%(dw_db,dw_table)) df.show(100) This is a simple hello-world example to get you started using Iceberg table formats with Spark3. For further reading, refer to: Introducing Apache Iceberg in Cloudera Data Platform Apache Iceberg Cloudera's Distribution of Spark3

Royles · ‎05-18-2021

Cloudera Data Platform provides two complementing services for data processing. Cloudera Data Engineering (CDE) for preparing and organising data to be consumed by Data Scientists in Cloudera Machine Learning (CML). This article provides a high level example of how to call CDE from CML in Python. ## Part 0: Imports # Requests will be used to make REST requests to the CDE service # JSON is used to manage payloads and responses # REST calls are made over HTTP and use BasicAuth # cde_endpoint is set to the high level URI to the CDE cluster endpoint import requests import json from requests.auth import HTTPBasicAuth cde_endpoint='cde-<cluster>.cloudera.site' The first step is establishing the credentials required to call the API. This is done by calling the Knox proxy and requesting a token. The token is then passed using the "Authorization" header. ## Part 1: Connect to Knox service. # Retrieve the JWT token and parse it for the 'access_token' part. # Need to pass workload username and password which is set as a project ENV variable. JWT = requests.get('https://service.'+cde_endpoint+'/gateway/authtkn/knoxtoken/api/v1/token', auth=HTTPBasicAuth(os.getenv('HADOOP_USER_NAME'), os.getenv('WLPASS'))) JWT_TOKEN = json.loads(JWT.content)['access_token'] auth_header={"Authorization": "Bearer %s" %JWT_TOKEN} The second step of the process is creating a resource to upload the .py file into. ## Part 2: Create a resource. # Resources provide a location to store files and dependancies. # Files are typically staged at /app/mount data={"name":"my-resource","type":"files"} resource = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/resources', json=data, headers=auth_header) resource We can then upload a file from a file stream opened locally in the CML project and passing the file over HTTP/PUT into the named resource. ## Part 3: File upload to the named resource. # Once the resource is created, it can host several files. # In this example we are opening a simple python file in the project folder. files = {'file':open('/home/cdsw/helloworld.py', 'rb')} upload = requests.put('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/resources/my-resource/helloworld.py', files=files, headers=auth_header) upload The job then needs to be defined, this sets the type of job "spark" and links to the relevant resources. ## Part 4: Define the Job and dependancies. # Create a job definition, define its name, the resource we will use and load the file as part of the spark process. data={"name":"my-job","type":"spark","mounts":[{"resourceName":"my-resource"}],"spark":{"file":"helloworld.py"}} job = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs', json=data, headers=auth_header) job Finally, we can run or schedule the job. ## Part 5: Run the named Job run = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs/my-job/run', headers=auth_header) run.content To validate the job status, we can list the jobs running on the given cluster. ## Part 6: List all Jobs joblist = requests.get('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs', headers=auth_header) joblist.content

Royles · ‎02-09-2021

Cloudera Machine Learning provides a number of methods of connecting to other CDP services and experiences such as a Cloudera Data Warehouse. In this post, we will connect using Python and the Impyla library, as well as using the embedded Cloudera Data Visualization. Using Impyla Within Cloudera Machine Learning, create a new project and set the language to Python 3.6. The connection details are available from the Data Warehouse console by copying the JDBC connection details which will look like. jdbc:impala://coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=<workload username>;PWD=<workload password> Use the following Python code to install Impyla and configure a connection: !pip3 install impyla==0.16a3 USERNAME='<workload username>' IMPALA_HOST='coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site' IMPALA_PORT='443' from impala.dbapi import connect conn = connect(host=IMPALA_HOST, port=IMPALA_PORT, auth_mechanism='LDAP', user=USERNAME, password=os.environ['PASS'], use_http_transport=True, http_path='/cliservice', use_ssl=True) cursor = conn.cursor() cursor.execute('show databases') for row in cursor: print(row) Note: The PASS variable is an Environment variable set in the Project settings under the Advanced tab. This does not protect your password but will mitigate the risk of it being copied into a version control service. Using Visual Applications Create a Cloudera Data Visualization App by following the instructions at Accessing Data Visualization in CML. Log out as your default user and log back into Cloudera Data Visualization using the local admin user account. Note: You can raise a support request if you don't have access to this. Add a new connection under Basic settings using the following parameters. Connection Name: Name your Connection Hostname or IP Address: Use the hostname from the JDBC string Port #: Use the SSL port of 443 Username: CDP Workload Username Password: CDP Workload Password Under Advanced Settings, set the following parameters. Connection Type: HTTP HTTP path: /cliservice Socket Type: SSL Test the connection.

Royles · ‎02-09-2021

Cloudera Machine Learning provides support for Python3. It is very straightforward to connect a session with an operational database. Provision an Operational Database Log into a CDP instance Select Operational Database Select Create Database Choose the Cloud environment Provide a unique name for the database Click Create Database Once the database has started, make a copy of the Phoenix (Thin) JDBC URL. This will be used as the connection string. Create a Machine Learning Project Within your Cloudera Machine Learning (CML) workspace, create a new project. Provide a name, and choose a blank initial setup. Create a session, and install phoenixdb using the command: !pip3 install phoenixdb Create a new Python file and paste the following code into the notebook. Import the required dependencies import phoenixdb import io import json Setup the parameters required to establish the connection with ODB. Refer to the Thin client details. opts = {} opts['authentication'] = 'BASIC' opts['serialization'] = 'PROTOBUF' opts['avatica_user'] = 'xxxxxxxx' opts['avatica_password'] = 'xxxxxxxx' database_url = 'https://<the jdbc url copied from the ODB console>/' TABLENAME = "us_population" conn = phoenixdb.connect(database_url, autocommit=True,**opts) For the URL, remove everything before the https and remove the parameters at the end, while retaining any path details. Example: https://<server>/<instance name>/cdp-proxy-api/avatica/ Create the table into which to insert the data curs = conn.cursor() query = """ CREATE TABLE IF NOT EXISTS """+TABLENAME+""" ( state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk PRIMARY KEY (state, city)) """ curs.execute(query) Bulk insert a set of data, using nested arrays for each record, and executing multiple upserts. sql = "upsert into " + TABLENAME + \ " (state ,city, population) \ values (?,?,?)" data =[['NY','New York',8143197], ['CA','Los Angeles',3844829], ['IL','Chicago',2842518], ['TX','Houston',2016582], ['PA','Philadelphia',1463281], ['AZ','Phoenix',1461575], ['TX','San Antonio',1256509], ['CA','San Diego',1255540], ['TX','Dallas',1213825], ['CA','San Jose',912332]] results = curs.executemany(sql,data) Finally, run a query to return an aggregated group-by and return as a Dictionary object. curs = conn.cursor(cursor_factory=phoenixdb.cursor.DictCursor) query = """SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum" FROM us_population GROUP BY state ORDER BY sum(population) DESC""" curs.execute(query) print(curs.fetchall()) When the above is run in a session, it will return the following results. [{'State': 'NY', 'City Count': 1, 'Population Sum': 8143197}, {'State': 'CA', 'City Count': 3, 'Population Sum': 6012701}, {'State': 'TX', 'City Count': 3, 'Population Sum': 4486916}, {'State': 'IL', 'City Count': 1, 'Population Sum': 2842518}, {'State': 'PA', 'City Count': 1, 'Population Sum': 1463281}, {'State': 'AZ', 'City Count': 1, 'Population Sum': 1461575}] This example is based on the post: Phoenix in 15 minutes or less

Royles · ‎01-18-2021

For a recent project I was tasked with configuring DBeaver to connect to Phoenix running in an instance of the CDP Datahub. CDP provides a means of creating a Datahub for running an Operational Database (HBase) and using JDBC via Phoenix to query it. Let's start a Datahub from an Operational Database template. Provision an Operational Database Log into a CDP Instance Select Data Hub Clusters Select Create Data Hub Choose the Cloud environment Choose the template 7.2.2 - Operational Database with SQL Provide a unique name for the cluster Click Provision Cluster This will start a Datahub cluster running HBase and Phoenix as well as all of the security dependencies provided by SDX. For example, Knox - this will be important when connecting to our instance. Once the cluster has started we need to collect some configuration details. This is best done on the Datahub information page and in the Cloudera Manager Console. We will use the Phoenix thin driver, and this requires a JDBC string of the form. jdbc:phoenix:thin:url=https://<knox endpoint>:443/<cluster name>/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user=<workload username>;avatica_password=<workload password> Once the cluster has started, select Endpoints and make a note of the Phoenix Query Server URI. It should look like this: https://<server>/opdbtest/cdp-proxy-api/avatica/ The path details here are important as they provide the proxy and the cluster name format that we need. The next piece of information we need is the Knox server endpoint. This can be found in the Cloudera Manager Console, under Knox/ Instances. This will replace the <server> part above. The final component we need is the JAR file containing the Phoenix Thin Client, and that can be sourced from the Cloudera repository here. https://repository.cloudera.com/ Search for phoenix-queryserver-client Download the latest release. Configuring DBeaver To Install DBeaver, you can download a version from https://dbeaver.io/ In this example, we are using the OSX version. Configuration fields and terms may vary my installed type. Create. new Apache Phoenix connection to provide a baseline For the host, use the machine that Knox is running on For the port, use 443 (default https port) Provide your workload username and password Edit Driver Configuration and set Class name: org.apache.phoenix.queryserver.client.Driver URL Template: jdbc:phoenix:thin:url={host}[:{port}]/opdbtest/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user={user};avatica_password={password} Add the driver JAR using add File and select the JAR downloaded from the Cloudera repository Note: Don't search for class as it may automatically discover an invalid Driver class. The Class name configuration will override this. You may need to restart DBeaver, if the class is set incorrectly. Close the configuration and test the connection. The URL uses Knox so that access control can be managed centrally. Knox takes the https:// messages and proxies them through to the backend Phoenix services automatically. A note on Operational Database Experience Shortly we will be providing an Operational Database Experience. We have significantly streamlined the provisioning of the Datahubs and publish a lot more metadata to help with configuring external clients. For example, the Maven links to the correct clients are provided directly and examples of the JDBC links are presented right in the user interface. All of these improvements have been made to help make provisioning new instances easy, and to make connecting to those instances from applications and tools very quick. Our objective is to help you integrate CDP with your applications quickly and efficiently. We welcome your feedback on areas of our platform and documentations, which can be improved to help us with this goal. Tips If you receive 404 or 401 errors, please check you are connecting to Knox, and that the full https:// url is correct. If you receive errors related to serialisation, make sure you have serialization=PROTOBUF set. Documentation references Setting up connections with a CDP Datahub Connect to PQS through Apache Knox Connecting to Apache Phoenix Query Server using the JDBC client Connect to PQS directly Setting up connections with CDP Operational Database Experience Cloudera Operational Database JDBC support

Online	Offline
Last Visited	‎10-29-2024 06:18 AM

Member Since	‎09-23-2015 08:47 AM
Last Visited	‎10-29-2024 06:18 AM
Posts	19
Kudos received	5

Cloudera Community

How to install and develop using Rust in Cloudera ...

Re: CCUs, Nodes and Cores

What does the 7.1.7 Long Term Service (LTS) mean?

CCUs, Nodes and Cores

Re: convert json to csv in nifi

Using Cloudera Machine Learning for Datalake and I...

Using Cloudera Machine Learning to run a Data Engi...

Connecting Cloudera Machine Learning to Cloudera D...

Using phoenixdb to connect Cloudera Machine Learni...

Connecting DBeaver to Cloudera Data Platform, Oper...