Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

spark MLIB for detecting outliers from CSV data



@Gayathri Devi, there is no direct method available for detecting outliers, but you can use quantiles approach to determine lower and upper bounds to filter the data. After the data is filtered, you can create ML Pipelines with all the transformation required to execute machine learning models (regression, classification etc).

Here is an example approach,

  • Convert string fields to a numeric representation using StringIndexer
  • Assemble string and numeric fields using Vector assembler
  • Create Linear/Logistic regression model
  • Create a ML pipeline with StringIndexerColumns, VectorAssembler and Model and execute on train data
  • Use the trained model to make predictions on Test Data
  • Create an evaluator and evaluate the predictions made on test data.

Please note that above approach was defined based on ML library instead of MLLib



Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.