Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark MLIB for detecting outliers from CSV data

spark MLIB for detecting outliers from CSV data


Re: spark MLIB for detecting outliers from CSV data


@Gayathri Devi, there is no direct method available for detecting outliers, but you can use quantiles approach to determine lower and upper bounds to filter the data. After the data is filtered, you can create ML Pipelines with all the transformation required to execute machine learning models (regression, classification etc).

Here is an example approach,

  • Convert string fields to a numeric representation using StringIndexer
  • Assemble string and numeric fields using Vector assembler
  • Create Linear/Logistic regression model
  • Create a ML pipeline with StringIndexerColumns, VectorAssembler and Model and execute on train data
  • Use the trained model to make predictions on Test Data
  • Create an evaluator and evaluate the predictions made on test data.

Please note that above approach was defined based on ML library instead of MLLib



Don't have an account?
Coming from Hortonworks? Activate your account here