Member since
06-20-2019
8
Posts
11
Kudos Received
0
Solutions
03-25-2020
06:58 AM
Hi Andre, For detailed checking disk performance you find here appropriate testing procedures: https://docs.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_stg_dev_accept_criteria.pdf Further good benchmark are here the TCP-DS tests for Impala: https://github.com/cloudera/impala-tpcds-kit and for Hive TCP-DS and TCP-H: https://github.com/hortonworks/hive-testbench Also recommend to check network performance via Cloudera Manager by the function Inspect Cluster Network Performance Hope this helps, Regards Friedel
... View more
05-27-2019
08:06 AM
Hello and thanks for your article. I did not quite get what is the role of HDP here. I Have already hdp sandbox and it has Yarn on it. Then I should install docker where and when? Thanks,
... View more
01-10-2019
12:00 PM
3 Kudos
Abstract Machine learning and supervised learning at scale is a common task that can be addressed with Apache Hive by leveraging Hivemall features and user-defined functions (UDFs) for HiveQL, which is strongly optimized for machine learning (ML) and data science. For example, logistic regression models, one of the basic tenants of ML, can also be easily built with Hivemall. The following example should explain the basic usage of Hivemall with examples of supervised learning of a simple regressor. Setup the environment -> HDP 3.0.1 Sandbox - Hivemall 0.5.2 see Machine Learning with Apache Hive SQL using Hivemall Explore the Training Data Database: foodmart - Table: sales_fact _dec_1998 and customer Create Table customer_training as
select s1.customer_id, c1.country, c1.gender, c1.member_card,c1.num_cars_owned, count(s1.product_id) as purchase_cnt
from sales_fact_dec_1998 s1
join customer c1
WHERE s1.customer_id = c1.customer_id
group by s1.customer_id, c1.country, c1.gender, c1.member_card, c1.num_cars_owned;
customer_id
country
gender member_card purchase_cnt
14
USA
F
Bronze
3
34
USA
F
Silver
14
39
USA
F
Bronze
6
48
Mexico
M
Bronze
8
55
USA
M
Normal
7
57
Mexico
M
Golden
8
74
USA
M
Bronze
6
76
USA
F
Normal
10
80
USA
F
Bronze
7
84
Canada
M
Bronze
1
87
Canada
F
Bronze
7
93
USA
M
Golden
8
98
Canada
F
Bronze
2
Customer Feature Representation Create the feature vector for the customer that relevant i.e. country, gender, member_card and num_cars_owned. Hivemall can have quantitive (mumeric) features like num_cars_owned or categorical (char/strings) as gender or country. create table customer_features (customer_id int, features array<string>, purchase_cnt int);
insert overwrite table customer_features
select
customer_id,
hivemall.array_concat(
hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) ,
hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card )
) as features,
sum(purchase_cnt)
from
customer_training
group by customer_id,
hivemall.array_concat(
hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) ,
hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card )
)
;
The customer_features should look like the following: customer_id features
purchase_cnt
6
["num_cars_owned:4.0","country#USA","gender#F","member_card#Bronze"]
4
10
["num_cars_owned:5.0","country#USA","gender#M","member_card#Golden"]
5
12
["num_cars_owned:2.0","country#USA","gender#F","member_card#Bronze"]
2
14
["num_cars_owned:3.0","country#USA","gender#F","member_card#Bronze"]
3
24
["num_cars_owned:3.0","country#Mexico","gender#F","member_card#Bronze"]
3
34
["num_cars_owned:14.0","country#USA","gender#F","member_card#Silver"]
14
36
["num_cars_owned:6.0","country#USA","gender#M","member_card#Normal"]
6
39
["num_cars_owned:6.0","country#USA","gender#F","member_card#Bronze"]
6
42
["num_cars_owned:7.0","country#Mexico","gender#F","member_card#Golden"]
7
48
["num_cars_owned:8.0","country#Mexico","gender#M","member_card#Bronze"]
8
52
["num_cars_owned:5.0","country#Mexico","gender#F","member_card#Silver"]
5
55
["num_cars_owned:7.0","country#USA","gender#M","member_card#Normal"]
7
Training Now that we have the customers with a defined set of features, the next step is to create the regressor. This requires us to specify an appropriate -loss_function. As this is squared for train_regressor, this means this query builds a simple linear regressor with the squared loss. create table if not exists customer_regressor as
select
hivemall.train_regressor(
features, -- feature vector
purchase_cnt, -- target value
'-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters
) as (feature, weight)
from
customer_features
; Table: Customer_Regression table
feature weight
country#Canada
0.036762696
country#Mexico
-0.07290672
country#USA
0.009568159
gender#F
-0.035828743
gender#M
-0.052134056
member_card#Bronze
-0.015939986
member_card#Golden
0.0020267756
member_card#Normal
-0.04439145
member_card#Silver
-0.059466142
num_cars_owned
1.0083996
Prediction Now we can compute the prediction for the number of purchases of a new customer. create table if not exists customer_new as
select 1 as id, array("num_cars_owned:3.0","country#USA","gender#F","member_card#Silver") as features
;
with features_exploded as (
select
id,
hivemall.extract_feature( fv ) as feature,
hivemall.extract_weight( fv ) as value
from customer_new t1 LATERAL VIEW explode(features) t2 as fv
)
select
t1.id,
sum(p1.weight * t1.value) as predicted_num_purchases
from
features_exploded t1
LEFT OUTER JOIN customer_regressor p1 ON (t1.feature = p1.feature)
group by
t1.id
; PREDICTED_NUM_PURCHASES
2.9394720904529095
... View more
Labels:
01-11-2019
07:44 PM
Very nice! I did two articles a few years ago, but things have advanced greatly. https://community.hortonworks.com/articles/88404/adding-and-using-hplsql-and-hivemall-with-hive-mac.html https://community.hortonworks.com/articles/67983/apache-hive-with-apache-hivemall.html
... View more
12-21-2018
01:57 PM
3 Kudos
With Stochastic Streaming Algorithms it's possible to analyze highly efficient and performant massive amounts of data. The main characteristics of sketches are very small size (sub-lineare space consumption), single-pass, mergeable and deliver approximate results in proven error bounds. If an approximate result is acceptable, then it is possible that data sketches are orders-of-magnitude faster compared to classic indexing or aggregation techniques. (I don't go into the mathematical discussion, more details here DataSketches) The following example shows how to use Data Sketches within Hive (also possible with Druid): Download and install Hortonworks Sandbox (it comes with demo database: foodmart) Download Java artifact sketches-hive-0.11.0-with-shaded-core.jar (provided by Yahoo) here jar repo and copy the jar file into HDFS /apps/hive/warehouse/lib directory #Login as hive user hive@sandbox-hdp
wget https://search.maven.org/remotecontent?filepath=com/yahoo/datasketches/sketches-hive/0.11.0/sketches-hive-0.11.0-with-shaded-core.jar -O sketches-hive-0.11.0-with-shaded-core.jar
hdfs dfs -mkdir /apps/hive/warehouse/lib
hdfs dfs -put sketches-hive-0.11.0-with-shaded-core.jar /apps/hive/warehouse/lib/ Create functions (here permanently) to create and manage the sketches with CLI beeline or Data Analytics Studio (DAS) beeline -u "jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive --force=true set hivevar:hivesketch_jar=hdfs:///apps/hive/warehouse/lib/sketches-hive-0.11.0-with-shaded-core.jar;
create database sketches;
use sketches;
create function qtl_data2sketch as 'com.yahoo.sketches.hive.quantiles.DataToDoublesSketchUDAF' USING JAR '${hivesketch_jar}' ;
create function qtl_getQuantile as 'com.yahoo.sketches.hive.quantiles.GetQuantileFromDoublesSketchUDF' USING JAR '${hivesketch_jar}'; Create a data sketch for product and price with the function qtl_data2sketch. The table product_sketch has two columns product_id that contains the products sold and a secondary binary field for the data sketch itself. use foodmart;
create table product_sketch (product_id double, qtl_sketch binary);
insert into product_sketch select cast(product_id as double), sketches.qtl_data2sketch(cast(store_sales as double)) from sales_fact_1997 group by product_id; The table product_sketch stores for every product the price sketch of the year (1997). The data sketch of the product can now retrieved i.e. to get a quantile of the data below 25%. select product_id, sketches.qtl_getQuantile(qtl_sketch, 0.25) from product_sketch limit 3; Product 25% Quantile 1 8.55 2 1.48 3 2.49 Use in a more advanced example i.e. to find sales where product revenue was below the 25% quantile. select sales.product_id, sum(store_sales * unit_sales ) sum_sales, sum( unit_sales) units, store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25) 25pct
from sales_fact_1997 sales
join product_sketch
on (product_sketch.product_id = sales.product_id
and store_sales < sketches.qtl_getQuantile(qtl_sketch, 0.25))
group by sales.product_id , store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25)
order by sum_sales desc
limit 10; sales.product_id sum_sales units store_sales 25pct 11 252.72 36 7.02 10.53 390 239.36 32 7.48 11.22 362 222.32 28 7.94 11.91 Table shows the top 3 products and revenue where the price is lower 25 % quantile of the price span (per product).
... View more
Labels:
04-09-2019
02:44 PM
Hi guys, i followed the above steps, and was able to execute commands like ( show databases, show tables) successfully, also created a database from spark-shell and created a table and inserted some data in it, but i am not able to query the data either from the newly created table from spark, nor the tables that already exists in hive, and getting this error java.lang.AbstractMethodError: Method com/hortonworks/spark/sql/hive/llap/HiveWarehouseDataSourceReader.createBatchDataReaderFactories()Ljava/util/List; is abstract at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader.createBatchDataReaderFactories(HiveWarehouseDataSourceReader.java) the commands is as below: import com.hortonworks.hwc.HiveWarehouseSession val hive = HiveWarehouseSession.session(spark).build() hive.createTable("hwx_table").column("value", "string").create() hive.executeUpdate("insert into hwx_table values('1')") hive.executeQuery("select * from hwx_table").show then the error appears, i am using the below command to start spark-shell spark-shell --master yarn --jars /usr/hdp/current/hive-warehouse-connector/hive-warehouse-connector_2.11-1.0.0.3.1.2.0-4.jar --conf spark.security.credentials.hiveserver2.enabled=false
... View more
04-20-2018
09:41 AM
To use DPS services, i.e. DLM, for replication between HDP clusters over wired encryptions leverages the Knox gateway functionality. For that two services must be configured in Knox and one Ranger Policy created. This is the quick setup of the required Ambari services in Knox and Ranger. Adding two new services (AMBARI and AMBARIUI) to the Knox advance topology configuration <service>
<role>AMBARI</role>
<url>http://knox-gw:8080</url>
</service>
<service>
<role>AMBARIUI</role>
<url>http://knox-gw:8080</url>
</service> In Ranger define a policy to allow the access to AmbariUI Then restart the all marked service in the cluster. Test that it works and login to Ambari by proxying Knox (default port: 8443). Now you can join the cluster in DPS.
... View more
Labels:
01-27-2018
10:11 AM
A small correction. It's introduced in Ranger 0.7 and policies should look like this: //HDFS
resource: path=/home/{USER}
user: {USER}
//Hive
resource: database=db_{USER}; table=*; column=*
user: {USER} where {USER} would substitute the user id of the currently logged in user.
... View more