Member since
10-06-2015
32
Posts
62
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8083 | 06-12-2016 11:22 AM | |
1177 | 05-11-2016 11:59 AM | |
1243 | 10-14-2015 05:47 AM |
07-11-2017
03:11 PM
Hello @Rajesh, Following the lovely request of a customer here, there is a HWKs tasks to support it with HDP3.0. If there is some way to backport it, I will try to update this thread. With kind regards
... View more
06-29-2017
02:56 PM
25 Kudos
Hadoop is a marvelous ecosystem, so many improvement in the last few years and so many way to extract insight from Big Data now. There is also an other sight that Hadoop can be complex to manage especially with new release, projects, versions, features and security. This is one of the reason why it is highly recommended to do DevOps. DevOps helps to enforce testing, to ensure smooth transition between environment and so to improve overall quality of any project running on the platform. In this presentation, I tried to explain the basic concept of DevOps and how Hadoop integrate nicely with tools like Jenkins, Ansible, Git, Maven and how it is amazing to use the advantage of a Big Data platform to monitor each component, projects and quality (delivery, performance & SLA, logs).
... View more
Labels:
09-15-2016
11:28 AM
9 Kudos
Why not trying a Kaggle Challenge (Titanic) ! (This is a work in progress, I will update this article as soon as I get more free time. Hope you will enjoy it !)
Let's create the corresponding database first :
CREATE DATABASE IF NOT EXISTS kaggle_titanic;
And then load the data :
USE kaggle_titanic;
DROP TABLE IF EXISTS train_tmp;
ADD JAR csv-serde-1.1.2-0.11.0-all.jar;
CREATE TABLE train_tmp (
PassengerId DOUBLE COMMENT 'regex : 999',
Survived DOUBLE COMMENT 'regex : 9',
Pclass INT COMMENT 'regex : 9',
Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!',
Sex STRING COMMENT 'regex : zzzzzz',
Age DOUBLE COMMENT 'regex : 99!9',
SibSp DOUBLE COMMENT 'regex : 9',
Parch DOUBLE COMMENT 'regex : 9',
Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999',
Fare DOUBLE COMMENT 'regex : 999!9999',
Cabin STRING COMMENT 'Z99_Z99_Z99_Z99',
Embarked STRING COMMENT 'regex : Z'
)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'train.csv' INTO TABLE train_tmp;
DROP TABLE IF EXISTS train;
CREATE TABLE train STORED AS ORC AS SELECT * FROM train_tmp;
USE kaggle_titanic;
DROP TABLE IF EXISTS test_tmp;
ADD JAR csv-serde-1.1.2-0.11.0-all.jar;
CREATE TABLE test_tmp (
PassengerId DOUBLE COMMENT 'regex : 999',
Pclass INT COMMENT 'regex : 9',
Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!',
Sex STRING COMMENT 'regex : zzzzzz',
Age DOUBLE COMMENT 'regex : 99!9',
SibSp DOUBLE COMMENT 'regex : 9',
Parch DOUBLE COMMENT 'regex : 9',
Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999',
Fare DOUBLE COMMENT 'regex : 999!9999',
Cabin STRING COMMENT 'Z99_Z99_Z99_Z99',
Embarked STRING COMMENT 'regex : Z'
)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'test.csv' INTO TABLE test_tmp;
DROP TABLE IF EXISTS test;
CREATE TABLE test STORED AS ORC AS SELECT * FROM test_tmp;
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test;
CREATE TABLE train_and_test STORED AS ORC AS
SELECT CAST(PassengerId AS INT) AS PassengerId, Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age, CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked
FROM train
UNION ALL
SELECT CAST(PassengerId AS INT) AS PassengerId, CAST(NULL AS Double) AS Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age , CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked
FROM test
;
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test;
CREATE TABLE train_and_test STORED AS ORC AS
SELECT PassengerId , Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked
FROM train
UNION ALL
SELECT PassengerId , CAST(NULL AS Double) AS Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked
FROM test
;
With some quick SQL queries we can already get some good overview of the data, make sure your
zeppelin is configured (as well as security)
Let's now clean the dataset :
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test_transform;
CREATE TABLE train_and_test_transform STORED AS ORC AS
SELECT PassengerId , Survived , Pclass , Name, regexp_extract(name, '([^,]*), ([^ ]*)(.*)', 2) AS title, Sex, CASE WHEN age IS NULL THEN 30 ELSE age END AS age ,
SibSp , Parch , Ticket , Fare , Cabin , COALESCE(embarked, 'embarked_is_NULL') AS embarked, substring(cabin,1,1) AS cabin_letter, LENGTH(regexp_replace(cabin, '[^ ]', '')) AS nbr_of_space_cabin,
CASE WHEN age IS NULL THEN true ELSE false END AS c_flag_age_null,
CASE WHEN Cabin IS NULL THEN true ELSE false END AS c_flag_cabin_null,
CASE WHEN embarked IS NULL THEN true ELSE false END AS c_flag_embarked_null
FROM train_and_test
;
---------------------------------------------------------------------------------------------------- (I will update this very soon)
And switch to spark-shell !
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer, VectorAssembler}
import org.apache.spark.ml.classification.GBTClassificationModel;
import org.apache.spark.ml.classification.GBTClassifier;
val data_prep_1 = sqlContext.sql("SELECT * FROM kaggle_titanic.train_and_test_transform");
data_prep_1.show(10);
Let's do the data transformation and prepare the feature column
val indexerSex = new StringIndexer().setInputCol("sex").setOutputCol("sexIndexed")
val indexerTitle = new StringIndexer().setInputCol("title").setOutputCol("titleIndexed")
val indexerEmbarked = new StringIndexer().setInputCol("embarked").setOutputCol("embarkedIndexed")
val data_prep_2 = indexerSex.fit(data_prep_1).transform(data_prep_1)
val data_prep_3 = indexerTitle.fit(data_prep_2).transform(data_prep_2)
val data_prep_4 = indexerEmbarked.fit(data_prep_3).transform(data_prep_3)
val vectorAssembler = new VectorAssembler().setInputCols(Array("sexIndexed", "titleIndexed", "embarkedIndexed", "age", "fare", "pclass", "parch", "sibsp", "c_flag_cabin_null", "c_flag_null_embarked")).setOutputCol("features")
val data_prep_5 = vectorAssembler.transform(data_prep_4)
data_prep_5.show()
val data_prep_6 = data_prep_5.filter($"survived".isNotNull)
val indexerSurvived = new StringIndexer().setInputCol("survived").setOutputCol("survivedIndexed")
val data_prep_7 = indexerSurvived.fit(data_prep_6).transform(data_prep_6)
And now let's build the model
val gbt = new GBTClassifier().setLabelCol("survivedIndexed").setFeaturesCol("features").setMaxIter(50)
var model = gbt.fit(data_prep_7)
Finally we can gather the data to predict and use the model :
val data_test_1 = data_prep_5.filter($"survived".isNull)
var result = model.transform(data_test_1)
result.select("passengerid", "prediction").show(1000)
... View more
Labels:
06-20-2016
03:50 AM
Dear Fabricio, I successfully make a workflow to tun from my local VM to my Hadoop remote Hadoop cluster by changing the SSH connection property. Hope that helps. Kind regards.
... View more
06-16-2016
03:23 PM
And don't forget to check this best practices : https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+Best+Practices
... View more
06-16-2016
02:11 PM
17 Kudos
First of all, you need a Hadoop user with correct permission with Ranger (I will use mine : mlanciau), some HDFS quota (if configured), and some YARN queue (I hope it is configured) For this example, I am using default Jenkins configuration with SSH plugin and the TeraSort benchmark suite. 1. Configure SSH remote hosts in Jenkins by entering your credential (of course you can use different config) 2. Create a new item (on the main portal), I called it Hadoop performance, then click on it and configure 3. Click on "This project is parameterised" and add these 2 parameters (you may want to change the parameter ;-)) 4. Go to build and click on add a step to the build and choose Execute shell script on remote host using ssh 5. Then use below information to configure the different scripts (you may want to configure your own user)
hdfs dfs -rm -r -skipTrash /user/mlanciau/teragen ; echo 'deleting directory' hdfs dfs -rm -r -skipTrash /user/mlanciau/terasort ; echo 'deleting directory' hdfs dfs -rm -r -skipTrash /user/mlanciau/teravalidate ; echo 'deleting directory' hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=${mapperNbr} 10000 /user/mlanciau/teragen hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapred.reduce.tasks=${reducerNbr} /user/mlanciau/teragen /user/mlanciau/terasort hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teravalidate /user/mlanciau/terasort /user/mlanciau/teravalidate 6. Click on save 7. Click on Build with Parameters, check the parameter values and then Build 8. Follow the progress by clicking on the step in progress and Console Output ! If you want you can also check execution time, success / failure thanks to the Jenkins UI Cheers 😉
... View more
06-16-2016
01:15 PM
This is a simple example here : https://community.hortonworks.com/articles/40171/simple-example-of-jenkins-hdp-integration.html, I will add more later
... View more
06-13-2016
07:58 PM
I think the key point is to configure the different Jenkins to be able to use the edge node via the different SSH plugin (or install it there), the rest is a matter of configuring security, backup, and choose the right number of parameter to fit your usage and switch easily from one environment to the other (dev, test, prod)
... View more
06-12-2016
11:22 AM
Sure so basically regarding the cluster, you may find useful to Configure queue with Capacity Scheduler (production, dev, integration, test), use elasticity and preemption Map users to queue You can use naming convention for queue and users by specifying -dev or -test Depending the tool you are using you can use Different database names with Hive Different directories with HDFS + quotas Namespace for HBase Ranger will help you configure the permission for each user / group to access the right resource Each user will have different environnement settings Use Jenkins and Maven (if needed) to build, push the code (with SSH plugin) and run the test Use template to provide tools to the user with logging features / correct parameter and option
... View more
06-09-2016
11:01 AM
Dear Fabricio, Yes we have several customers working on this topic. It is a interesting one. From what I have seen last time, the architecture was based on 2 real clusters, one PROD, one DR + TEST + INTEGRATION, with YARN queue and HDFS quota configured accordingly. Jenkins + SVN to take care of the versioning + build + test. Some great team has build also their own project to validate the dev and follow the deployment across different environments. I don't know too much about Docker, Mesos, Marathon so can't answer for this part. Can you perhaps give me more details about what you are looking for ? What did you try ? Kind regards.
... View more