About mlanciaux

mlanciaux · ‎06-29-2017

Hadoop is a marvelous ecosystem, so many improvement in the last few years and so many way to extract insight from Big Data now. There is also an other sight that Hadoop can be complex to manage especially with new release, projects, versions, features and security. This is one of the reason why it is highly recommended to do DevOps. DevOps helps to enforce testing, to ensure smooth transition between environment and so to improve overall quality of any project running on the platform. In this presentation, I tried to explain the basic concept of DevOps and how Hadoop integrate nicely with tools like Jenkins, Ansible, Git, Maven and how it is amazing to use the advantage of a Big Data platform to monitor each component, projects and quality (delivery, performance & SLA, logs).

mlanciaux · ‎09-15-2016

Why not trying a Kaggle Challenge (Titanic) ! (This is a work in progress, I will update this article as soon as I get more free time. Hope you will enjoy it !) Let's create the corresponding database first : CREATE DATABASE IF NOT EXISTS kaggle_titanic; And then load the data : USE kaggle_titanic; DROP TABLE IF EXISTS train_tmp; ADD JAR csv-serde-1.1.2-0.11.0-all.jar; CREATE TABLE train_tmp ( PassengerId DOUBLE COMMENT 'regex : 999', Survived DOUBLE COMMENT 'regex : 9', Pclass INT COMMENT 'regex : 9', Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!', Sex STRING COMMENT 'regex : zzzzzz', Age DOUBLE COMMENT 'regex : 99!9', SibSp DOUBLE COMMENT 'regex : 9', Parch DOUBLE COMMENT 'regex : 9', Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999', Fare DOUBLE COMMENT 'regex : 999!9999', Cabin STRING COMMENT 'Z99_Z99_Z99_Z99', Embarked STRING COMMENT 'regex : Z' ) ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\" ) STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'train.csv' INTO TABLE train_tmp; DROP TABLE IF EXISTS train; CREATE TABLE train STORED AS ORC AS SELECT * FROM train_tmp; USE kaggle_titanic; DROP TABLE IF EXISTS test_tmp; ADD JAR csv-serde-1.1.2-0.11.0-all.jar; CREATE TABLE test_tmp ( PassengerId DOUBLE COMMENT 'regex : 999', Pclass INT COMMENT 'regex : 9', Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!', Sex STRING COMMENT 'regex : zzzzzz', Age DOUBLE COMMENT 'regex : 99!9', SibSp DOUBLE COMMENT 'regex : 9', Parch DOUBLE COMMENT 'regex : 9', Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999', Fare DOUBLE COMMENT 'regex : 999!9999', Cabin STRING COMMENT 'Z99_Z99_Z99_Z99', Embarked STRING COMMENT 'regex : Z' ) ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\" ) STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'test.csv' INTO TABLE test_tmp; DROP TABLE IF EXISTS test; CREATE TABLE test STORED AS ORC AS SELECT * FROM test_tmp; USE kaggle_titanic; DROP TABLE IF EXISTS train_and_test; CREATE TABLE train_and_test STORED AS ORC AS SELECT CAST(PassengerId AS INT) AS PassengerId, Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age, CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked FROM train UNION ALL SELECT CAST(PassengerId AS INT) AS PassengerId, CAST(NULL AS Double) AS Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age , CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked FROM test ; USE kaggle_titanic; DROP TABLE IF EXISTS train_and_test; CREATE TABLE train_and_test STORED AS ORC AS SELECT PassengerId , Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked FROM train UNION ALL SELECT PassengerId , CAST(NULL AS Double) AS Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked FROM test ; With some quick SQL queries we can already get some good overview of the data, make sure your zeppelin is configured (as well as security) Let's now clean the dataset : USE kaggle_titanic; DROP TABLE IF EXISTS train_and_test_transform; CREATE TABLE train_and_test_transform STORED AS ORC AS SELECT PassengerId , Survived , Pclass , Name, regexp_extract(name, '([^,]*), ([^ ]*)(.*)', 2) AS title, Sex, CASE WHEN age IS NULL THEN 30 ELSE age END AS age , SibSp , Parch , Ticket , Fare , Cabin , COALESCE(embarked, 'embarked_is_NULL') AS embarked, substring(cabin,1,1) AS cabin_letter, LENGTH(regexp_replace(cabin, '[^ ]', '')) AS nbr_of_space_cabin, CASE WHEN age IS NULL THEN true ELSE false END AS c_flag_age_null, CASE WHEN Cabin IS NULL THEN true ELSE false END AS c_flag_cabin_null, CASE WHEN embarked IS NULL THEN true ELSE false END AS c_flag_embarked_null FROM train_and_test ; ---------------------------------------------------------------------------------------------------- (I will update this very soon) And switch to spark-shell ! import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer, VectorAssembler} import org.apache.spark.ml.classification.GBTClassificationModel; import org.apache.spark.ml.classification.GBTClassifier; val data_prep_1 = sqlContext.sql("SELECT * FROM kaggle_titanic.train_and_test_transform"); data_prep_1.show(10); Let's do the data transformation and prepare the feature column val indexerSex = new StringIndexer().setInputCol("sex").setOutputCol("sexIndexed") val indexerTitle = new StringIndexer().setInputCol("title").setOutputCol("titleIndexed") val indexerEmbarked = new StringIndexer().setInputCol("embarked").setOutputCol("embarkedIndexed") val data_prep_2 = indexerSex.fit(data_prep_1).transform(data_prep_1) val data_prep_3 = indexerTitle.fit(data_prep_2).transform(data_prep_2) val data_prep_4 = indexerEmbarked.fit(data_prep_3).transform(data_prep_3) val vectorAssembler = new VectorAssembler().setInputCols(Array("sexIndexed", "titleIndexed", "embarkedIndexed", "age", "fare", "pclass", "parch", "sibsp", "c_flag_cabin_null", "c_flag_null_embarked")).setOutputCol("features") val data_prep_5 = vectorAssembler.transform(data_prep_4) data_prep_5.show() val data_prep_6 = data_prep_5.filter($"survived".isNotNull) val indexerSurvived = new StringIndexer().setInputCol("survived").setOutputCol("survivedIndexed") val data_prep_7 = indexerSurvived.fit(data_prep_6).transform(data_prep_6) And now let's build the model val gbt = new GBTClassifier().setLabelCol("survivedIndexed").setFeaturesCol("features").setMaxIter(50) var model = gbt.fit(data_prep_7) Finally we can gather the data to predict and use the model : val data_test_1 = data_prep_5.filter($"survived".isNull) var result = model.transform(data_test_1) result.select("passengerid", "prediction").show(1000)

bleonhardi · ‎06-16-2016

That is amazing!

c-joe_boctor · ‎02-08-2019

Hi, Were you able to find a solution to this? We have a similar setup and I can't seem to find any examples of that.

mlanciaux · ‎06-09-2016

Can you give me top 50 min, max and the average. Also did you try the query ? What was the behaviour ? The reason I am asking that if your query is very long using a few number of reducer for example it may imply the skew and so to maximize usage of the cluster one way is too look at surrogate key creation.

mlanciaux · ‎04-23-2016

In a multiple application environment (MapReduce, Tez, others), it can appear that some application get stuck (blocked, deadlocked), it is likely your YARN / MAPREDUCE settings need to be reviewed. First of all, setup YARN queue, check the value of mapreduce.job.reduce.slowstart.completedmaps (interesting to change it to 0.9) and enable preemption.

mlanciaux · ‎01-04-2016

Dear Grace, We can start with this template and improve it : #!/bin/bash kinit ...... hdfs dfs -rm -r hdfs://.... sqoop import --connect "jdbc:sqlserver://....:1433;username=.....;password=….;database=....DB" --table ..... \ -m 1 --where "...... > 0" CR=$? if [ $CR -ne 0 ]; then echo 'Sqoop job failed' exit 1 fi hdfs dfs -cat hdfs://...../* > export_fs_table.txt CR=$? if [ $CR -ne 0 ]; then echo 'hdfs cat failed' exit 1 fi while IFS=',' read -r id tablename nbr flag; do sqoop import --connect "jdbc:sqlserver://......:1433;username=......;password=......;database=.......DB" --table $tablename CR=$? if [ $CR -ne 0 ]; then echo 'sqoop import failed for '$tablename exit 1 fi done < export_fs_table.txt Kind regards

nsabharwal · ‎12-08-2015

@mlanciaux Please see this thread.

MarceloSaiedOK · ‎07-21-2019

Hi i shared toad for hadoop at http://acnalert.eastus.cloudapp.azure.com/world/ i recommend to use the version 1.3 since 1.5 has a limit of licence

mlanciaux · ‎01-12-2016

Excellent @Ali Bajwa

Online	Offline
Last Visited	‎04-09-2018 02:28 PM

Member Since	‎10-06-2015 09:54 AM
Last Visited	‎04-09-2018 02:28 PM
Posts	32
Kudos received	61

Cloudera Community

Re: Bigdata Continuous Delivery

Re: Ambari-server is logging too many logs after i...

Re: Config inconsistency in ambari-server.log afte...

Hadoop & devOps : better together

Let's try a Kaggle Challenge with HDP !

Re: Simple example of Jenkins-HDP integration

Re: Bigdata Continuous Delivery

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

What if your Hadoop application get stuck

Re: Sqoop - dynamically import from SQL server

Re: What is the best-practice regarding filtering ...

Re: Hive GUI !

Re: New Visualization Feature in Hive View