Support Questions
Find answers, ask questions, and share your expertise

Data Mining - Data Pre-Processing Use Case

Hi,

Do you know any good tutorial/use case using Hadoop that shows a good approach to clean our data (specially the outliers detection phase)?

Thanks!

2 REPLIES 2

Contributor

@Pedro Rodgers

We have a few examples at Hortonworks doing outlier or anomaly detection.

For doing some data cleansing, you could look at HDF, or Apache NiFi to get started doing so. Starting with simple event processing like trimming or modifying the incoming data, NiFi could handle most use cases. There is even a detectDuplicate processor in NiFi which may be of use if deduplication is part of your cleansing process. When you start looking at aggregations/windowing, or complex cleaning/transformations Apache Storm (part of HDF), or Spark may be your best bet.

This demo actually shows HDF in action doing cleaning/transformations for Anomaly Detection:

https://hortonworks.com/webinar/solving-credit-card-fraud-challenges-hadoop/

Showing cleaned data with HDF & Hadoop: http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/

If you have more specific questions I'm sure we can narrow down and help provide more detailed help.

Super Guru

HDF, Spark, Sqoop, Flume, some Python scripts and you can pretty much clean any messy data.

I like to keep the raw data though, just in case.

https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.h...

https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.h...

See previous answers: https://community.hortonworks.com/questions/41874/data-cleaning-before-storing-in-hdfs.html

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.DetectDuplicat...

http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

https://community.hortonworks.com/articles/58265/analyzing-images-in-hdf-20-using-tensorflow.html https://community.hortonworks.com/articles/61180/streaming-ingest-of-google-sheets-into-a-connected.... https://community.hortonworks.com/articles/61717/ingesting-jms-messages-to-hdfs-via-hdf-20.html

https://community.hortonworks.com/articles/59394/csv-to-avro-conversion-with-nifi.html https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.... https://community.hortonworks.com/articles/59975/ingesting-edi-into-hdfs-using-hdf-20.html https://community.hortonworks.com/content/kbentry/59975/ingesting-edi-into-hdfs-using-hdf-20.html https://community.hortonworks.com/content/kbentry/59394/csv-to-avro-conversion-with-nifi.html

https://community.hortonworks.com/articles/34362/parsing-apache-log-files-with-spark.html https://community.hortonworks.com/content/kbentry/54947/reading-opendata-json-and-storing-into-phoen...