Member since
10-06-2015
32
Posts
62
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4468 | 06-12-2016 11:22 AM | |
602 | 05-11-2016 11:59 AM | |
653 | 10-14-2015 05:47 AM |
07-11-2017
03:11 PM
Hello @Rajesh, Following the lovely request of a customer here, there is a HWKs tasks to support it with HDP3.0. If there is some way to backport it, I will try to update this thread. With kind regards
... View more
06-29-2017
02:56 PM
25 Kudos
Hadoop is a marvelous ecosystem, so many improvement in the last few years and so many way to extract insight from Big Data now. There is also an other sight that Hadoop can be complex to manage especially with new release, projects, versions, features and security. This is one of the reason why it is highly recommended to do DevOps. DevOps helps to enforce testing, to ensure smooth transition between environment and so to improve overall quality of any project running on the platform. In this presentation, I tried to explain the basic concept of DevOps and how Hadoop integrate nicely with tools like Jenkins, Ansible, Git, Maven and how it is amazing to use the advantage of a Big Data platform to monitor each component, projects and quality (delivery, performance & SLA, logs).
... View more
- Find more articles tagged with:
- ansible
- devops
- git
- hadoop
- Hadoop Core
- How-ToTutorial
- jenkins
Labels:
09-15-2016
11:28 AM
9 Kudos
Why not trying a Kaggle Challenge (Titanic) ! (This is a work in progress, I will update this article as soon as I get more free time. Hope you will enjoy it !)
Let's create the corresponding database first :
CREATE DATABASE IF NOT EXISTS kaggle_titanic;
And then load the data :
USE kaggle_titanic;
DROP TABLE IF EXISTS train_tmp;
ADD JAR csv-serde-1.1.2-0.11.0-all.jar;
CREATE TABLE train_tmp (
PassengerId DOUBLE COMMENT 'regex : 999',
Survived DOUBLE COMMENT 'regex : 9',
Pclass INT COMMENT 'regex : 9',
Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!',
Sex STRING COMMENT 'regex : zzzzzz',
Age DOUBLE COMMENT 'regex : 99!9',
SibSp DOUBLE COMMENT 'regex : 9',
Parch DOUBLE COMMENT 'regex : 9',
Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999',
Fare DOUBLE COMMENT 'regex : 999!9999',
Cabin STRING COMMENT 'Z99_Z99_Z99_Z99',
Embarked STRING COMMENT 'regex : Z'
)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'train.csv' INTO TABLE train_tmp;
DROP TABLE IF EXISTS train;
CREATE TABLE train STORED AS ORC AS SELECT * FROM train_tmp;
USE kaggle_titanic;
DROP TABLE IF EXISTS test_tmp;
ADD JAR csv-serde-1.1.2-0.11.0-all.jar;
CREATE TABLE test_tmp (
PassengerId DOUBLE COMMENT 'regex : 999',
Pclass INT COMMENT 'regex : 9',
Name STRING COMMENT 'regex : _Zzz!_Zzzzz_!Zzzzzzzzz_!!Zzzzzz!!_Zzzzzz_Zzzzz!!',
Sex STRING COMMENT 'regex : zzzzzz',
Age DOUBLE COMMENT 'regex : 99!9',
SibSp DOUBLE COMMENT 'regex : 9',
Parch DOUBLE COMMENT 'regex : 9',
Ticket STRING COMMENT 'regex : ZZZZZ!Z!Z!_9999999',
Fare DOUBLE COMMENT 'regex : 999!9999',
Cabin STRING COMMENT 'Z99_Z99_Z99_Z99',
Embarked STRING COMMENT 'regex : Z'
)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'test.csv' INTO TABLE test_tmp;
DROP TABLE IF EXISTS test;
CREATE TABLE test STORED AS ORC AS SELECT * FROM test_tmp;
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test;
CREATE TABLE train_and_test STORED AS ORC AS
SELECT CAST(PassengerId AS INT) AS PassengerId, Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age, CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked
FROM train
UNION ALL
SELECT CAST(PassengerId AS INT) AS PassengerId, CAST(NULL AS Double) AS Survived , CAST(Pclass AS INT) , Name, Sex, CAST(Age AS DOUBLE) AS Age , CAST(SibSp AS INT) AS SibSp, CAST(Parch AS INT) AS Parch, Ticket , CAST(Fare AS DOUBLE) AS Fare , Cabin , Embarked
FROM test
;
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test;
CREATE TABLE train_and_test STORED AS ORC AS
SELECT PassengerId , Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked
FROM train
UNION ALL
SELECT PassengerId , CAST(NULL AS Double) AS Survived , Pclass , Name, Sex, Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked
FROM test
;
With some quick SQL queries we can already get some good overview of the data, make sure your
zeppelin is configured (as well as security)
Let's now clean the dataset :
USE kaggle_titanic;
DROP TABLE IF EXISTS train_and_test_transform;
CREATE TABLE train_and_test_transform STORED AS ORC AS
SELECT PassengerId , Survived , Pclass , Name, regexp_extract(name, '([^,]*), ([^ ]*)(.*)', 2) AS title, Sex, CASE WHEN age IS NULL THEN 30 ELSE age END AS age ,
SibSp , Parch , Ticket , Fare , Cabin , COALESCE(embarked, 'embarked_is_NULL') AS embarked, substring(cabin,1,1) AS cabin_letter, LENGTH(regexp_replace(cabin, '[^ ]', '')) AS nbr_of_space_cabin,
CASE WHEN age IS NULL THEN true ELSE false END AS c_flag_age_null,
CASE WHEN Cabin IS NULL THEN true ELSE false END AS c_flag_cabin_null,
CASE WHEN embarked IS NULL THEN true ELSE false END AS c_flag_embarked_null
FROM train_and_test
;
---------------------------------------------------------------------------------------------------- (I will update this very soon)
And switch to spark-shell !
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer, VectorAssembler}
import org.apache.spark.ml.classification.GBTClassificationModel;
import org.apache.spark.ml.classification.GBTClassifier;
val data_prep_1 = sqlContext.sql("SELECT * FROM kaggle_titanic.train_and_test_transform");
data_prep_1.show(10);
Let's do the data transformation and prepare the feature column
val indexerSex = new StringIndexer().setInputCol("sex").setOutputCol("sexIndexed")
val indexerTitle = new StringIndexer().setInputCol("title").setOutputCol("titleIndexed")
val indexerEmbarked = new StringIndexer().setInputCol("embarked").setOutputCol("embarkedIndexed")
val data_prep_2 = indexerSex.fit(data_prep_1).transform(data_prep_1)
val data_prep_3 = indexerTitle.fit(data_prep_2).transform(data_prep_2)
val data_prep_4 = indexerEmbarked.fit(data_prep_3).transform(data_prep_3)
val vectorAssembler = new VectorAssembler().setInputCols(Array("sexIndexed", "titleIndexed", "embarkedIndexed", "age", "fare", "pclass", "parch", "sibsp", "c_flag_cabin_null", "c_flag_null_embarked")).setOutputCol("features")
val data_prep_5 = vectorAssembler.transform(data_prep_4)
data_prep_5.show()
val data_prep_6 = data_prep_5.filter($"survived".isNotNull)
val indexerSurvived = new StringIndexer().setInputCol("survived").setOutputCol("survivedIndexed")
val data_prep_7 = indexerSurvived.fit(data_prep_6).transform(data_prep_6)
And now let's build the model
val gbt = new GBTClassifier().setLabelCol("survivedIndexed").setFeaturesCol("features").setMaxIter(50)
var model = gbt.fit(data_prep_7)
Finally we can gather the data to predict and use the model :
val data_test_1 = data_prep_5.filter($"survived".isNull)
var result = model.transform(data_test_1)
result.select("passengerid", "prediction").show(1000)
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- FAQ
- Hive
- How-ToTutorial
- Spark
Labels:
06-20-2016
03:50 AM
Dear Fabricio, I successfully make a workflow to tun from my local VM to my Hadoop remote Hadoop cluster by changing the SSH connection property. Hope that helps. Kind regards.
... View more
06-16-2016
03:23 PM
And don't forget to check this best practices : https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+Best+Practices
... View more
06-16-2016
02:11 PM
17 Kudos
First of all, you need a Hadoop user with correct permission with Ranger (I will use mine : mlanciau), some HDFS quota (if configured), and some YARN queue (I hope it is configured) For this example, I am using default Jenkins configuration with SSH plugin and the TeraSort benchmark suite. 1. Configure SSH remote hosts in Jenkins by entering your credential (of course you can use different config) 2. Create a new item (on the main portal), I called it Hadoop performance, then click on it and configure 3. Click on "This project is parameterised" and add these 2 parameters (you may want to change the parameter ;-)) 4. Go to build and click on add a step to the build and choose Execute shell script on remote host using ssh 5. Then use below information to configure the different scripts (you may want to configure your own user)
hdfs dfs -rm -r -skipTrash /user/mlanciau/teragen ; echo 'deleting directory' hdfs dfs -rm -r -skipTrash /user/mlanciau/terasort ; echo 'deleting directory' hdfs dfs -rm -r -skipTrash /user/mlanciau/teravalidate ; echo 'deleting directory' hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=${mapperNbr} 10000 /user/mlanciau/teragen hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapred.reduce.tasks=${reducerNbr} /user/mlanciau/teragen /user/mlanciau/terasort hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teravalidate /user/mlanciau/terasort /user/mlanciau/teravalidate 6. Click on save 7. Click on Build with Parameters, check the parameter values and then Build 8. Follow the progress by clicking on the step in progress and Console Output ! If you want you can also check execution time, success / failure thanks to the Jenkins UI Cheers 😉
... View more
- Find more articles tagged with:
- automate
- automation
- benchmark
- devops
- Governance & Lifecycle
- How-ToTutorial
06-16-2016
01:15 PM
This is a simple example here : https://community.hortonworks.com/articles/40171/simple-example-of-jenkins-hdp-integration.html, I will add more later
... View more
06-13-2016
07:58 PM
I think the key point is to configure the different Jenkins to be able to use the edge node via the different SSH plugin (or install it there), the rest is a matter of configuring security, backup, and choose the right number of parameter to fit your usage and switch easily from one environment to the other (dev, test, prod)
... View more
06-12-2016
11:22 AM
Sure so basically regarding the cluster, you may find useful to Configure queue with Capacity Scheduler (production, dev, integration, test), use elasticity and preemption Map users to queue You can use naming convention for queue and users by specifying -dev or -test Depending the tool you are using you can use Different database names with Hive Different directories with HDFS + quotas Namespace for HBase Ranger will help you configure the permission for each user / group to access the right resource Each user will have different environnement settings Use Jenkins and Maven (if needed) to build, push the code (with SSH plugin) and run the test Use template to provide tools to the user with logging features / correct parameter and option
... View more
06-09-2016
11:01 AM
Dear Fabricio, Yes we have several customers working on this topic. It is a interesting one. From what I have seen last time, the architecture was based on 2 real clusters, one PROD, one DR + TEST + INTEGRATION, with YARN queue and HDFS quota configured accordingly. Jenkins + SVN to take care of the versioning + build + test. Some great team has build also their own project to validate the dev and follow the deployment across different environments. I don't know too much about Docker, Mesos, Marathon so can't answer for this part. Can you perhaps give me more details about what you are looking for ? What did you try ? Kind regards.
... View more
06-09-2016
10:34 AM
Can you give me top 50 min, max and the average. Also did you try the query ? What was the behaviour ? The reason I am asking that if your query is very long using a few number of reducer for example it may imply the skew and so to maximize usage of the cluster one way is too look at surrogate key creation.
... View more
05-27-2016
03:31 PM
plz run that SELECT snapshot_id, COUNT(*)
FROM factsamplevalue
GROUP BY snapshot_id SELECT snapshot_id, COUNT(*)
FROM DimSnapshot
GROUP BY snapshot_id and if you can get an histogram, thanks
... View more
05-24-2016
06:03 AM
I understand but I would like to know if there is a data skew, telling me the max and the min, avg, perhaps a histogram will help me
... View more
05-23-2016
10:32 AM
Can you try to INSERT values in one table in Oracle (test) I guess you have GRANT since you have created tables but just to be sure.
... View more
05-23-2016
10:28 AM
Hello, I can help you to make this query works on your infrastructure. Can you give me the result of the number of element per key (for t.snapshot_id = f.snapshot_id), ie : how many time I will find the same key on both table. Not sure Indexes will help here first. There are several ways to optimize Hive query (like creating surrogate key) but we will check all option / parameter going forward. Kind regards.
... View more
05-11-2016
11:59 AM
1 Kudo
A case has been opened #00075378 and solution provided so I think this question can be marked as solved. Manual deletion on Ambari DB Ambari-server restart
... View more
05-11-2016
11:46 AM
Can you check you have tables created in the Oracle instance ?
... View more
05-11-2016
11:35 AM
Hello @Joshua Adeleke, I have seen some issues with Oracle and some of our components depending on the version. I will try to gather more information about this case.
... View more
04-23-2016
04:08 PM
In a multiple application environment (MapReduce, Tez, others), it can appear that some application get stuck (blocked, deadlocked), it is likely your YARN / MAPREDUCE settings need to be reviewed. First of all, setup YARN queue, check the value of mapreduce.job.reduce.slowstart.completedmaps (interesting to change it to 0.9) and enable preemption.
... View more
- Find more articles tagged with:
- Data Processing
- Issue Resolution
- issue-resolution
- Mapreduce
- YARN
- yarn-scheduler
Labels:
01-04-2016
02:07 PM
Dear Grace, We can start with this template and improve it : #!/bin/bash kinit ...... hdfs dfs -rm -r hdfs://.... sqoop import --connect "jdbc:sqlserver://....:1433;username=.....;password=….;database=....DB" --table ..... \ -m 1 --where "...... > 0" CR=$? if [ $CR -ne 0 ]; then echo 'Sqoop job failed' exit 1 fi hdfs dfs -cat hdfs://...../* > export_fs_table.txt CR=$? if [ $CR -ne 0 ]; then echo 'hdfs cat failed' exit 1 fi while IFS=',' read -r id tablename nbr flag; do sqoop import --connect "jdbc:sqlserver://......:1433;username=......;password=......;database=.......DB" --table $tablename CR=$? if [ $CR -ne 0 ]; then echo 'sqoop import failed for '$tablename exit 1 fi done < export_fs_table.txt Kind regards
... View more
11-18-2015
01:47 PM
4 Kudos
Hello, I have a customer wondering what can they use as Hive GUI (but not based on web technology, so not Ambari view or Hue) I was thinking of / found http://squirrel-sql.sourceforge.net/ and https://www.toadworld.com/products/toad-for-hadoop and probably eclipse with JDBC ? Any feedback is welcome. Thanks and kind regards.
... View more
Labels:
- Labels:
-
Apache Hive
10-14-2015
05:47 AM
Here more details : It was not a rolling upgrade and we found a Jira https://issues.apache.org/jira/browse/AMBARI-13358
... View more
10-13-2015
12:05 PM
Dear Jonas, Thanks for your reply. Yes we wanted to enable HDFS HA and WebHDFS HA within your Knox instance. We did follow those steps and it works like a charm for WebHDFS. I was wondering if there is something else to do following this documentation : http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Knox_Gateway_Admin_Guide/content/service_configuration_example.html and this comment : Both WEBHDFS and NAMENODE require a tag
(ha-alias) in order to work in High Availability mode. Maxime
... View more