Member since
09-23-2015
19
Posts
5
Kudos Received
0
Solutions
10-17-2023
10:16 PM
Install
Install Rust from Install Rust
This is interactive so needs to run in a terminal curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Hello World
Rust Hello World https://doc.rust-lang.org/book/ch01-02-hello-world.html
Reload your session and terminal to update the PATH env
Run rustc main.rs
main.rs
…
Why Rust?
Original Post from Stack Overflow: what is rust and why is it so popular
Rust has been Stack Overflow’s most loved language for four years in a row, indicating that many of those who have had the opportunity to use Rust have fallen in love with it. However, the roughly 97% of survey respondents who haven’t used Rust may wonder, “What’s the deal with Rust?”
Rust’s static typing does its best to get out of the programmer’s way while encouraging long-term maintainability.
Rust gives you the choice of storing data on the stack or the heap and determines at compile time when memory is no longer needed and can be cleaned up.
The biggest benefit Rust can provide compared to these languages is the borrow checker.
The Rust experience is larger than a language specification and a compiler; many aspects of creating and maintaining production-quality software are treated as first-class citizens.
... View more
Labels:
10-04-2023
02:08 PM
Possibly. Adding an additional CPU, may have license cost implications. A node (such as your CM host) can have upto 16 CCUs. A CCU is the total number of CPU cores / 6. So if the cores are reported across all installed CPUs in a node as 96 or less then no core based variable cost would be incurred. Note: Generally you take the total cores / memory across the whole cluster as part of the calculation so this provides some flexibility when adding a CPU to a single server.
... View more
03-28-2022
09:34 PM
1 Kudo
in August 2021, we released CDP Private Cloud Base 7.1.7 LTS. This article will outline what that means. For each software release, we provide a version, General Availability (GA) date, and End of Service (EoS) date. These are all published and described here. CDP Private Cloud Base 7.1.7 LTS has a GA date of August 2021 and an EoS date of August 2025, providing a 4-year support window. We then offer two paths, Service Pack Train The Service Pack train is for customers who value a stable platform, service a lot of users or use-cases, and want a low risk, low costs in maintenance and operations. Rather than taking the new feature release, customers can instead take regular Cumulative Hotfix (CHF)*. These hotfixes are released pro-actively to address break-fix issues and CVEs. At certain pre-planned release date, the hotfixes will be rolled up into a Service Pack (e.g. SP1)**. * Cumulative Hotfix means each hotfix includes those that came before up to the initial release or prior service pack, if any. ** It is important to note that service packs need to be installed in order (for example, SP1 + SP2). Service pack releases will not change the config, depreciate APIs, change schemas or break backward compatibility. Feature Release Train The Feature Release train is for customers who want the latest features and capabilities available from Cloudera. There may be some additional overhead in operations and systems testing. From 7.1.7, you will have the option of upgrading to 7.1.8 when it is released. 7.1.8 will include new features and these features may require application re-testing. However, it does mean you get the latest features and innovations in the platform sooner. Moving between trains If you would like to move between the trains, you can do so when a release is marked LTS. Both the Support and Feature trains can be upgraded via hotfixes. Generally upgrading from the Support train to the Feature train will require an upgrade compatible base and associated hotfixes to be in place. Detailed information on supported upgrade paths and versions will be maintained online CDP Private Cloud Base Upgrade and CDP In-Place Upgrade Paths The main objective is to offer flexibility in managing upgrades, reduce overall customer costs in operations and system testing, while reducing the number of platform variations we maintain going forward. Update - CDP Base 7.1.7, Service Pack 1 was released on 31st March 2022 It is described as "The LTS release provides a stable version of the platform to be supported for 4 years. There will be periodic cumulative hotfixes released to address any vulnerabilities or bug fixes to ensure this version remains secure, however, there won’t be any major changes or new features that require significant testing. Customers looking to utilize the latest feature set and innovations that Cloudera releases on CDP should consider upgrading to our regular Feature Release cycle. Feature Releases are released more frequently and are typically supported for 18 months. Service Pack 1 contains all the hotfixes from previously released cumulative hotfixes as well as additional CVE, security updates, critical bug fixes, and minor certifications. This addresses 45 CVE, Security and Critical Bug Fixes."
... View more
03-27-2022
08:48 PM
I often get asked about how Cloudera counts a node, core or CCU and wanted to bring together some resources to help answer these questions. The first resource to consider is our pricing and licensing FAQ/ Updates. Please review this as it is regularly maintained as we receive inbound suggestions and provide clarifications. Pricing update We also provide a formal policy on metric which is captured here, this represents the formal definition of CCU, Nodes and cores. Cloudera Licensed Metrics Cloudera Compute Unit (CCU) The link above defines in some detail what variable compute and storage is and the use of the CCU term. I will not duplicate those details here. CCU is simply a unit of measure and I will defer to the updated document as the unit of measure definition can change as pricing policies change. Nodes Generally, we determine the number of nodes based on the count returned in Cloudera Manager. Under Administration > License, the number of nodes is listed and all nodes counted need to be covered by the license.
... View more
Labels:
03-11-2022
03:58 AM
There are a few posts on this topic. The nearest to your need with a solution is https://community.cloudera.com/t5/Support-Questions/NiFi-JSON-Array-split/m-p/158842#M121233 The key being using the SplitJson processor the updated doc reference is https://nifi.apache.org/docs/nifi-docs/components/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.0/org.apache.nifi.processors.standard.SplitJson/index.html This will need to be used in combination with Reader > JsonPathReader > SplitJson > Writer
... View more
03-10-2022
06:41 PM
1 Kudo
This short example outlines how to configure Cloudera Machine Learning to integrate with the Data lake, Hive, and Iceberg tables. Apache Iceberg has a number of qualities that make it very suitable for Machine Learning. It maintains the metadata for the data layout alongside the data; this means there are no heavy network demands on the main catalog. The main catalog in this case Hive Metastore maintains a lightweight set of references to the data on the Data lake.
Iceberg also supports snapshots and time-travel so that data can be versioned and queried at a specific
point in time.
Enable parallel processing in Spark
Setup user credentials and default paths to the Data lake. spark-defaults.conf spark.executor.instances 2
spark.executor.memory 1g
spark.executor.cores 2
spark.hadoop.yarn.resourcemanager.principal christopherroyles
spark.yarn.access.hadoopFileSystems s3a://demo-aws-go02/
Setup a Spark session
Load the appropriate Iceberg JAR
Add the required SQL extensions
Configure the pluggable catalog
Define it as Hive
Set the paths for both the raw data on the Data lake and the table we will write to in Iceberg. example.py #
## 1.1 Datalake to Data Warehouse
# Load the Raw data in CSV format from the Datalake
# into the Data Warehouse, apply a suitable schema and snapshot
#.config("spark.jars.packages","org.apache.iceberg:iceberg-spark3-runtime:0.12.1")
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("1.1 - Ingest") \
.config("spark.jars","/home/cdsw/libs/iceberg-spark3-runtime-0.9.1.1.13.317211.0-9.jar") \
.config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
.config("spark.sql.catalog.spark_catalog.type","hive") \
.getOrCreate()
dl_path = "s3a://demo-aws-go02/user/royles/flightdata.csv"
dw_db = "spark_catalog.bronzeDB"
dw_table = "flightdata"
Setup the database and make sure it is placed into the correct catalog to benefit from the Iceberg features. # Use the Iceberg Catalog
spark.sql('SHOW CURRENT NAMESPACE').show()
spark.sql('SHOW DATABASES').show()
spark.sql("CREATE DATABASE IF NOT EXISTS "+dw_db)
spark.sql("USE "+dw_db")
spark.sql('SHOW TABLES').show()
Load some unstructured CSV data from the Data lake as files and infer a schema from the raw data.
Make sure the column names are compatible with the data warehouse SQL syntax. """Read in raw data without a schema"""
rawDF = spark.read \
.format("csv") \
.option("inferSchema","true") \
.option("header", "true") \
.option("delimiter",",") \
.option("quote", "\"") \
.option("escape", "\"") \
.load(dl_path)
rawDF.printSchema()
rawDF.show(2)
"""Need to normalise the schema"""
for name in rawDF.schema.names:
rawDF = rawDF.withColumnRenamed(name, name.replace(' ', '_'))
Write the table to an Iceberg table, overwriting anything that is there in the table already.
New data inserted will create new point-in-time versions. """Write the table out in iceberg+parquet format"""
rawDF.write \
.mode("overwrite") \
.format("iceberg") \
.saveAsTable(dw_table)
spark.sql('show tables').show()
spark.sql('SELECT * FROM %s LIMIT 10'%(dw_table)).show()
spark.read.format("iceberg").load('%s.%s.history'%(dw_db,dw_table)).show(20, False)
spark.read.format("iceberg").load('%s.%s.snapshots'%(dw_db,dw_table)).show(20, False)
spark.read.format("iceberg").load('%s.%s.files'%(dw_db,dw_table)).show(20, False)
spark.stop()
Some simple examples of selecting data at a point in time. from datetime import datetime
# current date and time
now = datetime.now()
timestamp = datetime.timestamp(now)
print("timestamp =", timestamp)
# Timestamps can be tricky. Please make sure to round your timestamp as shown below.
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load('%s.%s'%(dw_db,dw_table))
df.show(100)
This is a simple hello-world example to get you started using Iceberg table formats with Spark3. For further reading, refer to: Introducing Apache Iceberg in Cloudera Data Platform Apache Iceberg Cloudera's Distribution of Spark3
... View more
Labels:
05-18-2021
08:44 PM
1 Kudo
Cloudera Data Platform provides two complementing services for data processing. Cloudera Data Engineering (CDE) for preparing and organising data to be consumed by Data Scientists in Cloudera Machine Learning (CML). This article provides a high level example of how to call CDE from CML in Python.
## Part 0: Imports # Requests will be used to make REST requests to the CDE service # JSON is used to manage payloads and responses # REST calls are made over HTTP and use BasicAuth # cde_endpoint is set to the high level URI to the CDE cluster endpoint import requests import json from requests.auth import HTTPBasicAuth cde_endpoint='cde-<cluster>.cloudera.site'
The first step is establishing the credentials required to call the API. This is done by calling the Knox proxy and requesting a token. The token is then passed using the "Authorization" header.
## Part 1: Connect to Knox service. # Retrieve the JWT token and parse it for the 'access_token' part. # Need to pass workload username and password which is set as a project ENV variable. JWT = requests.get('https://service.'+cde_endpoint+'/gateway/authtkn/knoxtoken/api/v1/token', auth=HTTPBasicAuth(os.getenv('HADOOP_USER_NAME'), os.getenv('WLPASS'))) JWT_TOKEN = json.loads(JWT.content)['access_token'] auth_header={"Authorization": "Bearer %s" %JWT_TOKEN}
The second step of the process is creating a resource to upload the .py file into.
## Part 2: Create a resource. # Resources provide a location to store files and dependancies. # Files are typically staged at /app/mount data={"name":"my-resource","type":"files"} resource = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/resources', json=data, headers=auth_header) resource
We can then upload a file from a file stream opened locally in the CML project and passing the file over HTTP/PUT into the named resource.
## Part 3: File upload to the named resource. # Once the resource is created, it can host several files. # In this example we are opening a simple python file in the project folder. files = {'file':open('/home/cdsw/helloworld.py', 'rb')} upload = requests.put('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/resources/my-resource/helloworld.py', files=files, headers=auth_header) upload
The job then needs to be defined, this sets the type of job "spark" and links to the relevant resources.
## Part 4: Define the Job and dependancies. # Create a job definition, define its name, the resource we will use and load the file as part of the spark process. data={"name":"my-job","type":"spark","mounts":[{"resourceName":"my-resource"}],"spark":{"file":"helloworld.py"}} job = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs', json=data, headers=auth_header) job
Finally, we can run or schedule the job.
## Part 5: Run the named Job run = requests.post('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs/my-job/run', headers=auth_header) run.content
To validate the job status, we can list the jobs running on the given cluster.
## Part 6: List all Jobs joblist = requests.get('https://<virtual_prefix>.'+cde_endpoint+'/dex/api/v1/jobs', headers=auth_header) joblist.content
... View more
02-09-2021
09:32 PM
1 Kudo
Cloudera Machine Learning provides a number of methods of connecting to other CDP services and experiences such as a Cloudera Data Warehouse. In this post, we will connect using Python and the Impyla library, as well as using the embedded Cloudera Data Visualization.
Using Impyla
Within Cloudera Machine Learning, create a new project and set the language to Python 3.6. The connection details are available from the Data Warehouse console by copying the JDBC connection details which will look like. jdbc:impala://coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=<workload username>;PWD=<workload password>
Use the following Python code to install Impyla and configure a connection: !pip3 install impyla==0.16a3
USERNAME='<workload username>'
IMPALA_HOST='coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site'
IMPALA_PORT='443'
from impala.dbapi import connect
conn = connect(host=IMPALA_HOST,
port=IMPALA_PORT,
auth_mechanism='LDAP',
user=USERNAME,
password=os.environ['PASS'],
use_http_transport=True,
http_path='/cliservice',
use_ssl=True)
cursor = conn.cursor()
cursor.execute('show databases')
for row in cursor:
print(row)
Note: The PASS variable is an Environment variable set in the Project settings under the Advanced tab. This does not protect your password but will mitigate the risk of it being copied into a version control service.
Using Visual Applications
Create a Cloudera Data Visualization App by following the instructions at Accessing Data Visualization in CML.
Log out as your default user and log back into Cloudera Data Visualization using the local admin user account. Note: You can raise a support request if you don't have access to this.
Add a new connection under Basic settings using the following parameters.
Connection Name: Name your Connection
Hostname or IP Address: Use the hostname from the JDBC string
Port #: Use the SSL port of 443
Username: CDP Workload Username
Password: CDP Workload Password
Under Advanced Settings, set the following parameters.
Connection Type: HTTP
HTTP path: /cliservice
Socket Type: SSL
Test the connection.
... View more
02-09-2021
01:58 AM
Cloudera Machine Learning provides support for Python3. It is very straightforward to connect a session with an operational database.
Provision an Operational Database
Log into a CDP instance
Select Operational Database
Select Create Database
Choose the Cloud environment
Provide a unique name for the database
Click Create Database
Once the database has started, make a copy of the Phoenix (Thin) JDBC URL. This will be used as the connection string.
Create a Machine Learning Project
Within your Cloudera Machine Learning (CML) workspace, create a new project.
Provide a name, and choose a blank initial setup. Create a session, and install phoenixdb using the command: !pip3 install phoenixdb
Create a new Python file and paste the following code into the notebook. Import the required dependencies import phoenixdb
import io
import json
Setup the parameters required to establish the connection with ODB. Refer to the Thin client details.
opts = {}
opts['authentication'] = 'BASIC'
opts['serialization'] = 'PROTOBUF'
opts['avatica_user'] = 'xxxxxxxx'
opts['avatica_password'] = 'xxxxxxxx'
database_url = 'https://<the jdbc url copied from the ODB console>/'
TABLENAME = "us_population"
conn = phoenixdb.connect(database_url, autocommit=True,**opts)
For the URL, remove everything before the https and remove the parameters at the end, while retaining any path details.
Example:
https://<server>/<instance name>/cdp-proxy-api/avatica/
Create the table into which to insert the data curs = conn.cursor()
query = """
CREATE TABLE IF NOT EXISTS """+TABLENAME+""" (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city))
"""
curs.execute(query)
Bulk insert a set of data, using nested arrays for each record, and executing multiple upserts. sql = "upsert into " + TABLENAME + \
" (state ,city, population) \
values (?,?,?)"
data =[['NY','New York',8143197],
['CA','Los Angeles',3844829],
['IL','Chicago',2842518],
['TX','Houston',2016582],
['PA','Philadelphia',1463281],
['AZ','Phoenix',1461575],
['TX','San Antonio',1256509],
['CA','San Diego',1255540],
['TX','Dallas',1213825],
['CA','San Jose',912332]]
results = curs.executemany(sql,data)
Finally, run a query to return an aggregated group-by and return as a Dictionary object. curs = conn.cursor(cursor_factory=phoenixdb.cursor.DictCursor)
query = """SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC"""
curs.execute(query)
print(curs.fetchall())
When the above is run in a session, it will return the following results.
[{'State': 'NY', 'City Count': 1, 'Population Sum': 8143197}, {'State': 'CA', 'City Count': 3, 'Population Sum': 6012701}, {'State': 'TX', 'City Count': 3, 'Population Sum': 4486916}, {'State': 'IL', 'City Count': 1, 'Population Sum': 2842518}, {'State': 'PA', 'City Count': 1, 'Population Sum': 1463281}, {'State': 'AZ', 'City Count': 1, 'Population Sum': 1461575}]
This example is based on the post: Phoenix in 15 minutes or less
... View more
Labels:
01-18-2021
09:29 PM
3 Kudos
For a recent project I was tasked with configuring DBeaver to connect to Phoenix running in an instance of the CDP Datahub. CDP provides a means of creating a Datahub for running an Operational Database (HBase) and using JDBC via Phoenix to query it. Let's start a Datahub from an Operational Database template. Provision an Operational Database Log into a CDP Instance Select Data Hub Clusters Select Create Data Hub Choose the Cloud environment Choose the template 7.2.2 - Operational Database with SQL Provide a unique name for the cluster Click Provision Cluster This will start a Datahub cluster running HBase and Phoenix as well as all of the security dependencies provided by SDX. For example, Knox - this will be important when connecting to our instance. Once the cluster has started we need to collect some configuration details. This is best done on the Datahub information page and in the Cloudera Manager Console. We will use the Phoenix thin driver, and this requires a JDBC string of the form. jdbc:phoenix:thin:url=https://<knox endpoint>:443/<cluster name>/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user=<workload username>;avatica_password=<workload password> Once the cluster has started, select Endpoints and make a note of the Phoenix Query Server URI. It should look like this: https://<server>/opdbtest/cdp-proxy-api/avatica/ The path details here are important as they provide the proxy and the cluster name format that we need. The next piece of information we need is the Knox server endpoint. This can be found in the Cloudera Manager Console, under Knox/ Instances. This will replace the <server> part above. The final component we need is the JAR file containing the Phoenix Thin Client, and that can be sourced from the Cloudera repository here. https://repository.cloudera.com/ Search for phoenix-queryserver-client Download the latest release. Configuring DBeaver To Install DBeaver, you can download a version from https://dbeaver.io/ In this example, we are using the OSX version. Configuration fields and terms may vary my installed type. Create. new Apache Phoenix connection to provide a baseline For the host, use the machine that Knox is running on For the port, use 443 (default https port) Provide your workload username and password Edit Driver Configuration and set Class name: org.apache.phoenix.queryserver.client.Driver URL Template: jdbc:phoenix:thin:url={host}[:{port}]/opdbtest/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user={user};avatica_password={password} Add the driver JAR using add File and select the JAR downloaded from the Cloudera repository Note: Don't search for class as it may automatically discover an invalid Driver class. The Class name configuration will override this. You may need to restart DBeaver, if the class is set incorrectly. Close the configuration and test the connection. The URL uses Knox so that access control can be managed centrally. Knox takes the https:// messages and proxies them through to the backend Phoenix services automatically. A note on Operational Database Experience Shortly we will be providing an Operational Database Experience. We have significantly streamlined the provisioning of the Datahubs and publish a lot more metadata to help with configuring external clients. For example, the Maven links to the correct clients are provided directly and examples of the JDBC links are presented right in the user interface. All of these improvements have been made to help make provisioning new instances easy, and to make connecting to those instances from applications and tools very quick. Our objective is to help you integrate CDP with your applications quickly and efficiently. We welcome your feedback on areas of our platform and documentations, which can be improved to help us with this goal. Tips If you receive 404 or 401 errors, please check you are connecting to Knox, and that the full https:// url is correct. If you receive errors related to serialisation, make sure you have serialization=PROTOBUF set. Documentation references Setting up connections with a CDP Datahub Connect to PQS through Apache Knox Connecting to Apache Phoenix Query Server using the JDBC client Connect to PQS directly Setting up connections with CDP Operational Database Experience Cloudera Operational Database JDBC support
... View more
Labels: