About JosiahGoodson

JosiahGoodson · ‎08-31-2020

The Cloudera Operational Database (COD) experience is a managed dbPaaS solution. It can auto-scale based on the workload utilization of the cluster and will be adding the ability to auto-tune (better performance within the existing infrastructure footprint) and auto-heal (resolve operational problems automatically) later this year. COD is backed by the proven, scalable technology in Apache HBase and Apache Phoenix. There is an excellent blog post showing the basics of setting up a COD database and connecting to it using Python. COD offers options to connect using both HBase APIs and ANSI SQL through Apache Phoenix. This article shows how to connect to a COD database through Phoenix using Golang. All of the code for this article can be found in this Github repository. Getting started with COD is incredibly easy - no clusters to deploy and secure, simply log in and click “Create Database”: Select your CDP Environment and enter a name for the new database and click “Create Database”: Once the database is deployed and you will see the connection details for the database - for Golang and the calcite-avatica-go driver, we need the connection URL from the Phoenix (Thin) tab: Set your CDP workload password. From here, connecting to COD is just like any other Go sql.DB interface - provide a DSN including the database URL and credentials, and interact using ANSI SQL. Below we see a simple example of connecting, creating a table, inserting a row, and reading back that row. package main import ( "database/sql" "log" _ "github.com/apache/calcite-avatica-go" ) func main() { // Connections are defined by a DSN // The format is http://address:port[/schema][?parameter1=value&...parameterN=value] // For COD, BASIC authentication is used. // The workload username and password are passed as parameters avaticaUser and avaticaPassword // // For example: // COD URL: 'https://gateway.env.cloudera.site/cdp-proxy-api/avatica/' // Workload username: jgoodson // Workload password: Secret1! // Would result in this DSN: dsn := "https://gateway.env.cloudera.site/cdp-proxy-api/avatica/?&authentication=BASIC&avaticaUser=jgoodson&avaticaPassword=Secret1!" log.Println("Connecting...") db, err := sql.Open("avatica", dsn) if err != nil { log.Fatal("Connection: ", err) } defer db.Close() log.Println("Create table if not exists...") _, err = db.Exec("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, username VARCHAR)") if err != nil { log.Fatal("Create: ", err) } log.Println("Insert a row...") _, err = db.Exec("UPSERT INTO users VALUES (?, ?)", 1, "admin") if err != nil { log.Println("Insert: ", err) } log.Println("Reading and printing rows...") var ( id int username string ) rows, err := db.Query("SELECT id, username from users") if err != nil { log.Fatal("Query: ", err) } defer rows.Close() for rows.Next() { err := rows.Scan(&id, &username) if err != nil { log.Fatal(err) } log.Println(id, username) } } Have fun building your own applications in Golang to interact with Cloudera Operational Database!

JosiahGoodson · ‎09-06-2019

Why Share a Hive Metastore? Many organizations are moving workloads to the cloud to take advantage of the flexibility it offers. A key to this flexibility is a shared, persistent repository of metadata and services. Ephemeral compute clusters scale up and down and connect to this shared service layer. All clusters and metadata services share a unified storage layer. These capabilities are at the core of Cloudera’s next-generation product, the Cloudera Data Platform. Can we deploy this architecture today, with Hortonworks Data Platform 3? A key piece to this architecture is sharing a single Hive Metastore between all clusters. Hive is HDP’s SQL engine. The Hive metastore contains the metadata which allows services on each cluster to know where and how Hive tables are stored, and access those tables. Let’s look at our options. Standalone Hive Metastore Service A standalone Hive Metastore Service could be installed on a node outside of the HDP cluster. This configuration is not supported by Cloudera. To be supported, HMS must be installed on an Ambari-managed node within the HDP cluster. Shared Hive Metastore Service In this configuration, a single cluster is configured with a Hive Metastore Service. Any additional clusters are configured to use the HMS of the first cluster, rather than their own HMS. There are performance trade-offs. The load on the shared HMS in this configuration can reduce performance. Additionally, the fact that the HMS is not local to each cluster can lead to network and other latency. No Hive Metastore Service, Shared RDBMS Recall that the Hive Metastore Service sits on top of an RDBMS which contains the actual metadata. It is possible to configure all clusters to use their local metastore (configure Hive with hive.metastore.uris=<blank>) and share a common RDBMS. Bypassing the Hive Metastore service in this way gives significant performance gains, with some tradeoffs. First, all clusters connecting to the RDBMS must be fully trusted, as they will have un-restricted access to the metastore DB. Second, all clusters must be on the exact same version at all times. Many versions (and even patches) of HDP make changes to the metastore DB schema. Any cluster which connects with a different version can cause significant changes which will impact all other clusters. Ensuring this does not happen is usually done by the Hive Metastore Service, which is not present in this configuration. For further detail, please see this presentation by Yahoo! Japan in which they discuss the performance gains they saw using this architecture. Summary You can share a Hive Metastore between multiple clusters with HDP 3! There are management, performance, and security trade-offs that must be considered. Please contact Cloudera if you are interested in this type of deployment and would like to discuss further!

EricL · ‎09-01-2019

@VijayM , Couple of questions: 1. Are CDH6 and CDH5 managed by same Cloudera Manager, or you manage yourself? 2. From the setting you applied below: spark.yarn.jars=local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/hive/*,local:/app/bds/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/spark/lib/* It looks like you have both CDH6.2 and CDH5.16 running on the same host, is that right? Any reason you want to do so? As @JosiahGoodson mentioned, spark2 and spark1 are not compatible, you should either have Spark1, or Spark2 jars in the classpath, not both, otherwise they will conflict. Cheers Eric

JosiahGoodson · ‎08-29-2019

Hello @Teej The short answer is that FetchX (FetchFTP for example) is Nifi cluster friendly, while GetX processors are not. There is a common pattern ("List-Fetch") of using a single node to ListX then pass that List to all nodes in the cluster to do parallelized FetchX - the Fetch will be aware that there are multiple nodes and only Fetch each file once. If you have a NiFi cluster and you are using the GetSFTP processor, you would have to configure that processor to run on the primary node only so the other nodes in the cluster wouldn't try to pull the same files. You can read more about it here.

JosiahGoodson · ‎08-27-2019

Hello @kal It looks like some of the DataNodes are not in sync, first try restarting your DataNodes. If you continue to receive the errors, please check the following: 1. Please check /etc/hosts file, it should be in sync on all the DataNodes and NameNodes if you are not using DNS. 2. Please check if iptables is running on few DataNodes, apply a for loop and quickly check on all the DataNodes. 3. Please check if time is in sync on all the DataNodes. Time on NameNode and Datanodes should be in sync.

JosiahGoodson · ‎08-27-2019

Hello @prathamesh_h There is great documentation on rowkey design here: https://hbase.apache.org/book.html#rowkey.design At a high level, you want to ensure that your rowkeys are as evenly distributed as possible. If you have very few sites and many articles for each site, you may not see great performance. You can consider ways to break your articles into smaller buckets within each site, and including this in your rowkey.

Online	Offline
Last Visited	‎10-14-2020 05:10 PM

Member Since	‎03-12-2019 07:25 AM
Last Visited	‎10-14-2020 05:10 PM
Posts	11
Kudos received	3

Cloudera Community

Re: Fetch Vs Get vs List processors in NiFi

How to connect Go Applications to Cloudera Operati...

Shared Hive Metastore in HDP 3

Re: Connecting to remote spark cluster fails

Re: Fetch Vs Get vs List processors in NiFi

Re: DataXceiver error processing REQUEST_SHORT_CIR...

Re: Best Apache Hbase Rowkey Design For Inventory ...