Member since
02-29-2016
108
Posts
213
Kudos Received
14
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2053 | 08-18-2017 02:09 PM | |
3590 | 06-16-2017 08:04 PM | |
3383 | 01-20-2017 03:36 AM | |
9073 | 01-04-2017 03:06 AM | |
4558 | 12-09-2016 08:27 PM |
11-03-2016
04:58 PM
1 Kudo
Yes and check my answer on another thread https://community.hortonworks.com/questions/46772/how-to-save-dataframe-as-text-file.html#answer-46773
... View more
11-03-2016
04:09 AM
2 Kudos
Following the security lab and reach the following step https://github.com/HortonworksUniversity/Security_Labs#refresh-hdfs-user-group-mappings Run into problem refresh the user-group mapping from AD [root@qwang-hdp0 ~]# sudo sudo -u hdfs kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-qi
[root@qwang-hdp0 ~]# sudo sudo -u hdfs hdfs dfsadmin -refreshUserToGroupsMappings
Refresh user to groups mapping successful
Then kinit to hr1 user and check the user-group mapping, it doesn't seems to sync correctly for hdfs, hdfs group command not returning the rigth group, where yarn rmadmin is fine. [root@qwang-hdp0 ~]# kinit hr1
Password for hr1@EXAMPLE.COM:
[root@qwang-hdp0 ~]# hdfs groups
hr1@EXAMPLE.COM :
[root@qwang-hdp0 ~]# yarn rmadmin -getGroups hr1
16/11/03 01:30:36 INFO client.RMProxy: Connecting to ResourceManager at hdp1.example.com/172.xx.xxx.xxx:8141
hr1 : domain_users hadoop-users hr
[root@qwang-hdp0 ~]# id hr1
uid=1960401170(hr1) gid=1960400513(domain_users) groups=1960400513(domain_users),1960401154(hr),1960401151(hadoop-users)
The hdfs group is not matching to the AD settings. and ldapsearch confirm the AD setting is there [root@qwang-hdp0 ~]# ldapsearch -h ad01.field.hortonworks.com -p 389 -D "binduser@example.com" -W -b "DC=field,DC=my_org,DC=com" "(sAMAccountName=hr1)"
Enter LDAP Password:
...
memberOf: CN=hr,OU=CorpUsers,DC=field,DC=my_org,DC=com
memberOf: CN=hadoop-users,OU=CorpUsers,DC=field,DC= my_org,DC=com ...
Could you suggest what is going wrong and what to do to trouble shoot/correct the issue
... View more
Labels:
- Labels:
-
Apache Hadoop
11-02-2016
02:36 PM
2 Kudos
this is addressed in the latest sandbox, no an issue any more
... View more
11-01-2016
01:42 AM
the exam is setup on ubuntu with centOS VM as HDP. gedit is on ubuntu. I guess you could install gedit on your own environment but it is very easy to use, so no worry. If you really want to try the test environment, try use HDPCD practice exam, very similar. http://hortonworks.com/wp-content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf The document will be accessible during exam. it is the link you used for apache site http://spark.apache.org/docs For Hortonworks document, it is under http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/index.html Stick with Apache document as the exam is not really anything Hortonworks specific. There is no way to change the exam environment. You have very limited permissions.
... View more
10-31-2016
06:33 PM
5 Kudos
the test environment is on AMS virtual. When I took the test, it was HDP2.3 and I am not sure what version is used now. you could use the current sandbox for your exercise. Spark is something later than 1.4, probably 1.5. But the knowledge covered are all basic RDD and dataframe that are not very much linked to newer versions. test environment has no IDE. You use either gedit or vi base on you preference. debug with spark-shell or pyspark couple notes on the exam 1. know RDD and dataframe api well. Go through all the docs in the test web page. 2. know how to import and export RDD/dataframe from/to csv files. 3. there is no limit on how you finish the task, so choose the technical you are most familiar with either the API or Spark SQL 4. test environment is quite slow in response, so be patient with it and leave enough time for tasks. Good luck taking the exam.
... View more
10-03-2016
05:14 PM
8 Kudos
Most user access control is accomplished on the group level. One of the example mentioned in Ranger release is to control each doctor only able to view the patients of his/her own. However most example provided up-to-date are mostly static filter condition that will not change for individual user. In this tutorial, we would demonstrate how to create a row level filter that applies to a group but takes different effect on the user level by leveraging UDF. The use case is as following: a patient table contains patient information as well as the id of the doctor treating the patient (this is a simplified example, in really world the relation may be a many-to-many relation, but it could a handled by the same approach). We also have a doctor reference table where the doctor login credential could be found. We will use the login credential to identify the doctor and only show the patient that associated with the doctor. To setup the environment, first download the HDP2.5 Sandbox form Hortonworks website, then follow the instruction to setup the sandbox. Then login to Ambari through 127.0.0.1:8080 using admin/admin and setup the data using the following script in Hive View. create table patients (ssn string, name string, doctor_id int) stored as orc;
create table doctorRef (id int, username string) stored as orc;
insert into doctorRef values (1, 'raj_ops'), (2, 'holger_gov');
insert into patients values
('111-11-1111', 'John Doo', 1),
('222-22-2222', 'Amy Long', 1),
('333-33-3333', 'Muni White', 2),
('444-44-4444', 'Kerry Chang', 2),
('555-55-5555', 'Holy Ng', 1); Then we will setup the dynamic row level filter inside Ranger. Login to Ranger at 127.0.0.1:6080 using admin/admin, then go to Access Manger -> Resource Based Policies, under Hive, click Sandbox_Hive. Then click Row Level Filter Add new policy like the following Policy Type: row level filter Policy Name: dynamic row level filter Hive database: default Hive table: patients Audit logging: yes Description: Select Group: public Select User: Access Type: select Row level filter: doctor_id in (select id from doctorRef where username = current_user())
Then click save.
Now we could test the row level filter policy by login to Ambari 127.0.0.1:8080 as raj_ops/raj_ops, and only the patients with the doctor_id 1 is showing up. When we use UDF current_user() that identify the user in current hive session, we could apply row level filter in group level that could have different effect on each user. This is particularly useful in the environment where large number of users are managed through group policy and users move between groups quite frequently. So no access control need to be done on user level, rather policy could be defined on group level and then through the dynamic row level filter, it will be applied to user level in a dynamic fashion.
... View more
Labels:
10-03-2016
03:09 PM
3 Kudos
After some help from Ranger PM, I was told this is possible by using the Hive UDF which could get the current user that's running Hive context. The function is current_user(). The solution would be to include the user information in the data or have a lookup table associate the user id with the user name. then the role level filter would look like something like UserName = current_user() or UserID in (select uid from userLookup where uname = current_user()) And I just recently punlished a HCC article detailing how to do it https://community.hortonworks.com/content/kbentry/59582/create-dynamic-row-level-filter-in-ranger.html
... View more
09-26-2016
06:33 PM
@slachterman, since you mentioned "Filter conditions can reference other objects", could you please give a couple examples. I could think of using time like LastUpdateTime >= DATEADD(DAY, -30, GetDate()) What other kinds of object could be used?
... View more
09-26-2016
06:06 PM
2 Kudos
Could the row level filter setup on a group be dynamic for each member of the group? Use case is like the following: data analysts are all in a group calledDA. If a RLF policy is setup for the DA group but the goal is to limit each member of the group to only access data assigned to him/her. this could be done easily on per-user level policy, but if the member of the group is huge, it could be lots policies to manage. Is there a way to set it on the group level something like AssigneeID = @userID where @userID is associated with each member.
... View more
Labels:
- Labels:
-
Apache Ranger
08-31-2016
01:50 AM
8 Kudos
As the release of Spark 2.0 finally came, the machine learning library of Spark has been changed from the mllib to ml. One of the biggest change in the new ml library is the introduction of so-called machine learning pipeline. It provides a high level abstraction of the machine learning flow and greatly simplified the creation of machine learning process. In this tutorial, we will walk through the steps on how to create a machine learning pipeline and also explain what is under the hood in the pipeline. (Note: the code in this article is based off the technical preview ml library from Spark 1.6 and with minor change, they should run on Spark 2.0 as well) In this tutorial, we will demonstrate the process to create a pipeline in Spark to predict airline flight delay. The dataset we use contain the airline flights information from 2007 and 2008. We use 2007 data as the training dataset and 2008 data as the testing dataset. We will predict flight delay using 2 classification models, logistic regression and decision tree. The tutorial will walk step-by-step how to use the new ml library transformers to do data munging and then chain them together with machine learning algorithm to create a pipeline to complete the whole process in one step. The tutorial is not meant to be a data science guideline on how to process data and choose machine learning model. It only meant to guide you on how to build Spark ml pipeline in Scala. The choices of data munging decisions as well as the models are not optimal from data science aspect. Environment
We will use the Hortonworks HDP 2.4 sandbox for this tutorial. Please prepare the Sandbox following this instruction and go to the Zeppelin UI at http://127.0.0.1:9995 All the code example are executed inside Zeppelin notebook Get the dataset First, create a new Zeppelin notebook and name it whatever you like Next, use the code below to download the dataset from internet and upload them to HDFS %sh
wget http://stat-computing.org/dataexpo/2009/2007.csv.bz2 -O /tmp/flights_2007.csv.bz2
wget http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -O /tmp/flights_2008.csv.bz2
hdfs dfs -mkdir /tmp/airflightsdelays
hdfs dfs -put /tmp/flights_2007.csv.bz2 /tmp/flights_2008.csv.bz2 /tmp/airflightsdelays/
Load data to DataFrame Next, we will load the data from HDFS into Spark dataframe so it could be processed by ml pipeline. The new ml pipeline only process data inside dataframe, not in RDD like the old mllib. There are 2 dataframe being created, one for training data and one for testing data. A helper function is created to convert the military format time into a integer which is the number of minutes from midnight so we could use it as numeric feature. The detail about the dataset could be found here. %spark
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.attribute.NominalAttribute
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//calculate minuted from midnight, input is military time format
def getMinuteOfDay(depTime: String) : Int = (depTime.toInt / 100).toInt * 60 + (depTime.toInt % 100)
//define schema for raw data
case class Flight(Month: String, DayofMonth: String, DayOfWeek: String, DepTime: Int, CRSDepTime: Int, ArrTime: Int, CRSArrTime: Int, UniqueCarrier: String, ActualElapsedTime: Int, CRSElapsedTime: Int, AirTime: Int, ArrDelay: Double, DepDelay: Int, Origin: String, Distance: Int)
val flight2007 = sc.textFile("/tmp/airflightsdelays/flights_2007.csv.bz2")
val header = flight2007.first
val trainingData = flight2007
.filter(x => x != header)
.map(x => x.split(","))
.filter(x => x(21) == "0")
.filter(x => x(17) == "ORD")
.filter(x => x(14) != "NA")
.map(p => Flight(p(1), p(2), p(3), getMinuteOfDay(p(4)), getMinuteOfDay(p(5)), getMinuteOfDay(p(6)), getMinuteOfDay(p(7)), p(8), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toDouble, p(15).toInt, p(16), p(18).toInt))
.toDF
trainingData.cache
val flight2008 = sc.textFile("/tmp/airflightsdelays/flights_2008.csv.bz2")
val testingData = flight2008
.filter(x => x != header)
.map(x => x.split(","))
.filter(x => x(21) == "0")
.filter(x => x(17) == "ORD")
.filter(x => x(14) != "NA")
.map(p => Flight(p(1), p(2), p(3), getMinuteOfDay(p(4)), getMinuteOfDay(p(5)), getMinuteOfDay(p(6)), getMinuteOfDay(p(7)), p(8), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toDouble, p(15).toInt, p(16), p(18).toInt))
.toDF
testingData.cache
Create feature transformers The building blocks of ml pipeline are transformers and estimators. Transformers Transform one dataframe to another dataframe. Estimators Fit one dataframe and produce a model, which is a transformer. Most Transforms are under org.apache.spark.ml.feature package and the code below shows 2 of them, StringIndexer and VectorAssembler. StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library. In the dataset we used, we will transfer month, day of month, day of week, carrier and origin airport code using this transformer. Notice we provide the input column name and the output column name as parameters at the time of initialization of the StringIndexer. VectorAssembler constructs Vector from raw feature columns. Most ml machine learning algorithms take features in the form of vector. import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
//tranformor to convert string to category values
val monthIndexer = new StringIndexer().setInputCol("Month").setOutputCol("MonthCat")
val dayofMonthIndexer = new StringIndexer().setInputCol("DayofMonth").setOutputCol("DayofMonthCat")
val dayOfWeekIndexer = new StringIndexer().setInputCol("DayOfWeek").setOutputCol("DayOfWeekCat")
val uniqueCarrierIndexer = new StringIndexer().setInputCol("UniqueCarrier").setOutputCol("UniqueCarrierCat")
val originIndexer = new StringIndexer().setInputCol("Origin").setOutputCol("OriginCat")
//assemble raw feature
val assembler = new VectorAssembler()
.setInputCols(Array("MonthCat", "DayofMonthCat", "DayOfWeekCat", "UniqueCarrierCat", "OriginCat", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "ActualElapsedTime", "CRSElapsedTime", "AirTime","DepDelay", "Distance"))
.setOutputCol("rawFeatures")
Create machine learning pipeline In the last section, we created a few more transformers and defined the parameters for each of them so they operate on the input dataframe and produce desired output dataframe. VectorSlicer takes a vector as input column and create a new vector which contain only part of the attributes of the original vector. Binarizer create a binary value, 0 or 1, from input column of double value with a threshold. StandardScaler is used to scale vector to a new vector which values are in similar scale. We will initialize an estimator, the Logistic Regression Classifier, and chain them together in a machine learning pipeline. When the pipeline is used to fit training data, the transformers and estimator will apply to the dataframe in the sequence defined in the pipeline. The pipeline model produced by fitting the training data contains almost exactly the same process like the pipeline estimator. The testing data will go through the same data munging process except for the last step, where the original estimator in the pipeline estimator is replaced with a model, which is a transformer. import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StandardScaler
//vestor slicer
val slicer = new VectorSlicer().setInputCol("rawFeatures").setOutputCol("slicedfeatures").setNames(Array("MonthCat", "DayofMonthCat", "DayOfWeekCat", "UniqueCarrierCat", "DepTime", "ArrTime", "ActualElapsedTime", "AirTime", "DepDelay", "Distance"))
//scale the features
val scaler = new StandardScaler().setInputCol("slicedfeatures").setOutputCol("features").setWithStd(true).setWithMean(true)
//labels for binary classifier
val binarizerClassifier = new Binarizer().setInputCol("ArrDelay").setOutputCol("binaryLabel").setThreshold(15.0)
//logistic regression
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("binaryLabel").setFeaturesCol("features")
// Chain indexers and tree in a Pipeline
val lrPipeline = new Pipeline().setStages(Array(monthIndexer, dayofMonthIndexer, dayOfWeekIndexer, uniqueCarrierIndexer, originIndexer, assembler, slicer, scaler, binarizerClassifier, lr))
// Train model.
val lrModel = lrPipeline.fit(trainingData)
// Make predictions.
val lrPredictions = lrModel.transform(testingData)
// Select example rows to display.
lrPredictions.select("prediction", "binaryLabel", "features").show(20) The code below uses another estimator, Decision Tree Classifier. The pipeline is made up with different chain of transformers. But the concept is the same as the previous pipeline. You can see how easy it is to create a different pipeline using existing transformer. PCA is a dimension reduction transformer. VectorIndexer convert values inside vector that could be categorical indices to indices. Bucktizer convert continuous value into categorical indices based on provided threshold. import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.PCA
//index category index in raw feature
val indexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("rawFeaturesIndexed").setMaxCategories(10)
//PCA
val pca = new PCA().setInputCol("rawFeaturesIndexed").setOutputCol("features").setK(10)
//label for multi class classifier
val bucketizer = new Bucketizer().setInputCol("ArrDelay").setOutputCol("multiClassLabel").setSplits(Array(Double.NegativeInfinity, 0.0, 15.0, Double.PositiveInfinity))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier().setLabelCol("multiClassLabel").setFeaturesCol("features")
// Chain all into a Pipeline
val dtPipeline = new Pipeline().setStages(Array(monthIndexer, dayofMonthIndexer, dayOfWeekIndexer, uniqueCarrierIndexer, originIndexer, assembler, indexer, pca, bucketizer, dt))
// Train model.
val dtModel = dtPipeline.fit(trainingData)
// Make predictions.
val dtPredictions = dtModel.transform(testingData)
// Select example rows to display.
dtPredictions.select("prediction", "multiClassLabel", "features").show(20)
Conclusion Spark machine learning pipeline is a very efficient way of creating machine learning flow. It eliminates the needs to write a lot of boiler-plate code during the data munging process. It also guarantee the training data and testing data go through exactly the same data processing without any additional effort. The same pipeline concept has been implemented in many other popular machine learning library and glad to see it finally available in Spark. How to run this tutorial with Spark 2.0 Since the tutorial was first published, Spark 2.x has gain lots of popularity and many users asked how to run this tutorial in Spark 2.x. With HDP 2.5 Sandbox, Spark 2.0 is included as Tech Preview. Because the version of Zeppelin in HDP 2.5 is not compatible with Spark 2.0 yet, the only to run the sample code is to use spark-shell in console First you need to switch the version of Spark and launch spark-shell export SPARK_MAJOR_VERSION=2
spark-shell
Once inside spark-shell, confirm Spark 2.x is the version used, you should see followings with the HDP 2.5 Sandbox scala> sc.version
res5: String = 2.0.0.2.5.0.0-1245
The only code that need to be modified for the above tutorial code to work in side Spark 2.0 is a import line. spark.milib is the lib for Spark 1.x, just change it to spark.ml, all code would work in Spark2.0 import org.apache.spark.ml.linalg.Vectors For reference, here are the code that works in Spark2.0 spark-shell import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.attribute.NominalAttribute
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
def getMinuteOfDay(depTime: String) : Int = (depTime.toInt / 100).toInt * 60 + (depTime.toInt % 100)
case class Flight(Month: String, DayofMonth: String, DayOfWeek: String, DepTime: Int, CRSDepTime: Int, ArrTime: Int, CRSArrTime: Int, UniqueCarrier: String, ActualElapsedTime: Int, CRSElapsedTime: Int, AirTime: Int, ArrDelay: Double, DepDelay: Int, Origin: String, Distance: Int)
val flight2007 = sc.textFile("/tmp/airflightsdelays/flights_2007.csv.bz2")
val header = flight2007.first
val trainingData = flight2007.filter(x => x != header).map(x => x.split(",")).filter(x => x(21) == "0").filter(x => x(17) == "ORD").filter(x => x(14) != "NA").map(p => Flight(p(1), p(2), p(3), getMinuteOfDay(p(4)), getMinuteOfDay(p(5)), getMinuteOfDay(p(6)), getMinuteOfDay(p(7)), p(8), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toDouble, p(15).toInt, p(16), p(18).toInt)).toDF
trainingData.cache
val flight2008 = sc.textFile("/tmp/airflightsdelays/flights_2008.csv.bz2")
val testingData = flight2008.filter(x => x != header).map(x => x.split(",")).filter(x => x(21) == "0").filter(x => x(17) == "ORD").filter(x => x(14) != "NA").map(p => Flight(p(1), p(2), p(3), getMinuteOfDay(p(4)), getMinuteOfDay(p(5)), getMinuteOfDay(p(6)), getMinuteOfDay(p(7)), p(8), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toDouble, p(15).toInt, p(16), p(18).toInt)).toDF
testingData.cache
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
val monthIndexer = new StringIndexer().setInputCol("Month").setOutputCol("MonthCat")
val dayofMonthIndexer = new StringIndexer().setInputCol("DayofMonth").setOutputCol("DayofMonthCat")
val dayOfWeekIndexer = new StringIndexer().setInputCol("DayOfWeek").setOutputCol("DayOfWeekCat")
val uniqueCarrierIndexer = new StringIndexer().setInputCol("UniqueCarrier").setOutputCol("UniqueCarrierCat")
val originIndexer = new StringIndexer().setInputCol("Origin").setOutputCol("OriginCat")
val assembler = new VectorAssembler().setInputCols(Array("MonthCat", "DayofMonthCat", "DayOfWeekCat", "UniqueCarrierCat", "OriginCat", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "ActualElapsedTime", "CRSElapsedTime", "AirTime","DepDelay", "Distance")).setOutputCol("rawFeatures")
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StandardScaler
val slicer = new VectorSlicer().setInputCol("rawFeatures").setOutputCol("slicedfeatures").setNames(Array("MonthCat", "DayofMonthCat", "DayOfWeekCat", "UniqueCarrierCat", "DepTime", "ArrTime", "ActualElapsedTime", "AirTime", "DepDelay", "Distance"))
val scaler = new StandardScaler().setInputCol("slicedfeatures").setOutputCol("features").setWithStd(true).setWithMean(true)
val binarizerClassifier = new Binarizer().setInputCol("ArrDelay").setOutputCol("binaryLabel").setThreshold(15.0)
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("binaryLabel").setFeaturesCol("features")
val lrPipeline = new Pipeline().setStages(Array(monthIndexer, dayofMonthIndexer, dayOfWeekIndexer, uniqueCarrierIndexer, originIndexer, assembler, slicer, scaler, binarizerClassifier, lr))
val lrModel = lrPipeline.fit(trainingData)
val lrPredictions = lrModel.transform(testingData)
lrPredictions.select("prediction", "binaryLabel", "features").show(20)
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.PCA
val indexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("rawFeaturesIndexed").setMaxCategories(10)
val pca = new PCA().setInputCol("rawFeaturesIndexed").setOutputCol("features").setK(10)
val bucketizer = new Bucketizer().setInputCol("ArrDelay").setOutputCol("multiClassLabel").setSplits(Array(Double.NegativeInfinity, 0.0, 15.0, Double.PositiveInfinity))
val dt = new DecisionTreeClassifier().setLabelCol("multiClassLabel").setFeaturesCol("features")
val dtPipeline = new Pipeline().setStages(Array(monthIndexer, dayofMonthIndexer, dayOfWeekIndexer, uniqueCarrierIndexer, originIndexer, assembler, indexer, pca, bucketizer, dt))
val dtModel = dtPipeline.fit(trainingData)
val dtPredictions = dtModel.transform(testingData)
dtPredictions.select("prediction", "multiClassLabel", "features").show(20)
... View more
Labels: