About Friedel

Friedel · ‎03-25-2020

Hi Andre, For detailed checking disk performance you find here appropriate testing procedures: https://docs.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_stg_dev_accept_criteria.pdf Further good benchmark are here the TCP-DS tests for Impala: https://github.com/cloudera/impala-tpcds-kit and for Hive TCP-DS and TCP-H: https://github.com/hortonworks/hive-testbench Also recommend to check network performance via Cloudera Manager by the function Inspect Cluster Network Performance Hope this helps, Regards Friedel

miss_ahoor · ‎05-27-2019

Hello and thanks for your article. I did not quite get what is the role of HDP here. I Have already hdp sandbox and it has Yarn on it. Then I should install docker where and when? Thanks,

Friedel · ‎01-10-2019

Abstract Machine learning and supervised learning at scale is a common task that can be addressed with Apache Hive by leveraging Hivemall features and user-defined functions (UDFs) for HiveQL, which is strongly optimized for machine learning (ML) and data science. For example, logistic regression models, one of the basic tenants of ML, can also be easily built with Hivemall. The following example should explain the basic usage of Hivemall with examples of supervised learning of a simple regressor. Setup the environment -> HDP 3.0.1 Sandbox - Hivemall 0.5.2 see Machine Learning with Apache Hive SQL using Hivemall Explore the Training Data Database: foodmart - Table: sales_fact _dec_1998 and customer Create Table customer_training as select s1.customer_id, c1.country, c1.gender, c1.member_card,c1.num_cars_owned, count(s1.product_id) as purchase_cnt from sales_fact_dec_1998 s1 join customer c1 WHERE s1.customer_id = c1.customer_id group by s1.customer_id, c1.country, c1.gender, c1.member_card, c1.num_cars_owned; customer_id country gender member_card purchase_cnt 14 USA F Bronze 3 34 USA F Silver 14 39 USA F Bronze 6 48 Mexico M Bronze 8 55 USA M Normal 7 57 Mexico M Golden 8 74 USA M Bronze 6 76 USA F Normal 10 80 USA F Bronze 7 84 Canada M Bronze 1 87 Canada F Bronze 7 93 USA M Golden 8 98 Canada F Bronze 2 Customer Feature Representation Create the feature vector for the customer that relevant i.e. country, gender, member_card and num_cars_owned. Hivemall can have quantitive (mumeric) features like num_cars_owned or categorical (char/strings) as gender or country. create table customer_features (customer_id int, features array<string>, purchase_cnt int); insert overwrite table customer_features select customer_id, hivemall.array_concat( hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) , hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card ) ) as features, sum(purchase_cnt) from customer_training group by customer_id, hivemall.array_concat( hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) , hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card ) ) ; The customer_features should look like the following: customer_id features purchase_cnt 6 ["num_cars_owned:4.0","country#USA","gender#F","member_card#Bronze"] 4 10 ["num_cars_owned:5.0","country#USA","gender#M","member_card#Golden"] 5 12 ["num_cars_owned:2.0","country#USA","gender#F","member_card#Bronze"] 2 14 ["num_cars_owned:3.0","country#USA","gender#F","member_card#Bronze"] 3 24 ["num_cars_owned:3.0","country#Mexico","gender#F","member_card#Bronze"] 3 34 ["num_cars_owned:14.0","country#USA","gender#F","member_card#Silver"] 14 36 ["num_cars_owned:6.0","country#USA","gender#M","member_card#Normal"] 6 39 ["num_cars_owned:6.0","country#USA","gender#F","member_card#Bronze"] 6 42 ["num_cars_owned:7.0","country#Mexico","gender#F","member_card#Golden"] 7 48 ["num_cars_owned:8.0","country#Mexico","gender#M","member_card#Bronze"] 8 52 ["num_cars_owned:5.0","country#Mexico","gender#F","member_card#Silver"] 5 55 ["num_cars_owned:7.0","country#USA","gender#M","member_card#Normal"] 7 Training Now that we have the customers with a defined set of features, the next step is to create the regressor. This requires us to specify an appropriate -loss_function. As this is squared for train_regressor, this means this query builds a simple linear regressor with the squared loss. create table if not exists customer_regressor as select hivemall.train_regressor( features, -- feature vector purchase_cnt, -- target value '-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters ) as (feature, weight) from customer_features ; Table: Customer_Regression table feature weight country#Canada 0.036762696 country#Mexico -0.07290672 country#USA 0.009568159 gender#F -0.035828743 gender#M -0.052134056 member_card#Bronze -0.015939986 member_card#Golden 0.0020267756 member_card#Normal -0.04439145 member_card#Silver -0.059466142 num_cars_owned 1.0083996 Prediction Now we can compute the prediction for the number of purchases of a new customer. create table if not exists customer_new as select 1 as id, array("num_cars_owned:3.0","country#USA","gender#F","member_card#Silver") as features ; with features_exploded as ( select id, hivemall.extract_feature( fv ) as feature, hivemall.extract_weight( fv ) as value from customer_new t1 LATERAL VIEW explode(features) t2 as fv ) select t1.id, sum(p1.weight * t1.value) as predicted_num_purchases from features_exploded t1 LEFT OUTER JOIN customer_regressor p1 ON (t1.feature = p1.feature) group by t1.id ; PREDICTED_NUM_PURCHASES 2.9394720904529095

TimothySpann · ‎01-11-2019

Very nice! I did two articles a few years ago, but things have advanced greatly. https://community.hortonworks.com/articles/88404/adding-and-using-hplsql-and-hivemall-with-hive-mac.html https://community.hortonworks.com/articles/67983/apache-hive-with-apache-hivemall.html

Friedel · ‎12-21-2018

With Stochastic Streaming Algorithms it's possible to analyze highly efficient and performant massive amounts of data. The main characteristics of sketches are very small size (sub-lineare space consumption), single-pass, mergeable and deliver approximate results in proven error bounds. If an approximate result is acceptable, then it is possible that data sketches are orders-of-magnitude faster compared to classic indexing or aggregation techniques. (I don't go into the mathematical discussion, more details here DataSketches) The following example shows how to use Data Sketches within Hive (also possible with Druid): Download and install Hortonworks Sandbox (it comes with demo database: foodmart) Download Java artifact sketches-hive-0.11.0-with-shaded-core.jar (provided by Yahoo) here jar repo and copy the jar file into HDFS /apps/hive/warehouse/lib directory #Login as hive user hive@sandbox-hdp wget https://search.maven.org/remotecontent?filepath=com/yahoo/datasketches/sketches-hive/0.11.0/sketches-hive-0.11.0-with-shaded-core.jar -O sketches-hive-0.11.0-with-shaded-core.jar hdfs dfs -mkdir /apps/hive/warehouse/lib hdfs dfs -put sketches-hive-0.11.0-with-shaded-core.jar /apps/hive/warehouse/lib/ Create functions (here permanently) to create and manage the sketches with CLI beeline or Data Analytics Studio (DAS) beeline -u "jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive --force=true set hivevar:hivesketch_jar=hdfs:///apps/hive/warehouse/lib/sketches-hive-0.11.0-with-shaded-core.jar; create database sketches; use sketches; create function qtl_data2sketch as 'com.yahoo.sketches.hive.quantiles.DataToDoublesSketchUDAF' USING JAR '${hivesketch_jar}' ; create function qtl_getQuantile as 'com.yahoo.sketches.hive.quantiles.GetQuantileFromDoublesSketchUDF' USING JAR '${hivesketch_jar}'; Create a data sketch for product and price with the function qtl_data2sketch. The table product_sketch has two columns product_id that contains the products sold and a secondary binary field for the data sketch itself. use foodmart; create table product_sketch (product_id double, qtl_sketch binary); insert into product_sketch select cast(product_id as double), sketches.qtl_data2sketch(cast(store_sales as double)) from sales_fact_1997 group by product_id; The table product_sketch stores for every product the price sketch of the year (1997). The data sketch of the product can now retrieved i.e. to get a quantile of the data below 25%. select product_id, sketches.qtl_getQuantile(qtl_sketch, 0.25) from product_sketch limit 3; Product 25% Quantile 1 8.55 2 1.48 3 2.49 Use in a more advanced example i.e. to find sales where product revenue was below the 25% quantile. select sales.product_id, sum(store_sales * unit_sales ) sum_sales, sum( unit_sales) units, store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25) 25pct from sales_fact_1997 sales join product_sketch on (product_sketch.product_id = sales.product_id and store_sales < sketches.qtl_getQuantile(qtl_sketch, 0.25)) group by sales.product_id , store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25) order by sum_sales desc limit 10; sales.product_id sum_sales units store_sales 25pct 11 252.72 36 7.02 10.53 390 239.36 32 7.48 11.22 362 222.32 28 7.94 11.91 Table shows the top 3 products and revenue where the price is lower 25 % quantile of the price span (per product).

tarekabouzeid91 · ‎04-09-2019

Hi guys, i followed the above steps, and was able to execute commands like ( show databases, show tables) successfully, also created a database from spark-shell and created a table and inserted some data in it, but i am not able to query the data either from the newly created table from spark, nor the tables that already exists in hive, and getting this error java.lang.AbstractMethodError: Method com/hortonworks/spark/sql/hive/llap/HiveWarehouseDataSourceReader.createBatchDataReaderFactories()Ljava/util/List; is abstract at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader.createBatchDataReaderFactories(HiveWarehouseDataSourceReader.java) the commands is as below: import com.hortonworks.hwc.HiveWarehouseSession val hive = HiveWarehouseSession.session(spark).build() hive.createTable("hwx_table").column("value", "string").create() hive.executeUpdate("insert into hwx_table values('1')") hive.executeQuery("select * from hwx_table").show then the error appears, i am using the below command to start spark-shell spark-shell --master yarn --jars /usr/hdp/current/hive-warehouse-connector/hive-warehouse-connector_2.11-1.0.0.3.1.2.0-4.jar --conf spark.security.credentials.hiveserver2.enabled=false

Friedel · ‎04-20-2018

To use DPS services, i.e. DLM, for replication between HDP clusters over wired encryptions leverages the Knox gateway functionality. For that two services must be configured in Knox and one Ranger Policy created. This is the quick setup of the required Ambari services in Knox and Ranger. Adding two new services (AMBARI and AMBARIUI) to the Knox advance topology configuration <service> <role>AMBARI</role> <url>http://knox-gw:8080</url> </service> <service> <role>AMBARIUI</role> <url>http://knox-gw:8080</url> </service> In Ranger define a policy to allow the access to AmbariUI Then restart the all marked service in the cluster. Test that it works and login to Ambari by proxying Knox (default port: 8443). Now you can join the cluster in DPS.

sivaprasanna246 · ‎01-27-2018

A small correction. It's introduced in Ranger 0.7 and policies should look like this: //HDFS resource: path=/home/{USER} user: {USER} //Hive resource: database=db_{USER}; table=*; column=* user: {USER} where {USER} would substitute the user id of the currently logged in user.

Online	Online
Last Visited	‎01-02-2025 03:39 AM

Member Since	‎06-20-2019 02:57 AM
Last Visited	‎01-02-2025 03:39 AM
Posts	8
Kudos received	11

Cloudera Community

Re: CDP DC 7.0 (soon 7.1) benchmark/performance te...

Re: Tensorflow Serving with Docker on YARN

Supervised Learning with Apache Hive and Hivemall

Re: Machine Learning with SQL using Apache Hive an...

Using Data Sketches for Fast Analytics with Apache...

Re: Integrating Apache Hive with Apache Spark - Hi...

Ranger and Knox configuration for Ambari for DPS

Re: Ranger User Variables use for HDFS policies