Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data Mining - Data Pre-Processing Use Case

Highlighted

Data Mining - Data Pre-Processing Use Case

Hi,

Do you know any good tutorial/use case using Hadoop that shows a good approach to clean our data (specially the outliers detection phase)?

Thanks!

2 REPLIES 2

Re: Data Mining - Data Pre-Processing Use Case

Contributor

@Pedro Rodgers

We have a few examples at Hortonworks doing outlier or anomaly detection.

For doing some data cleansing, you could look at HDF, or Apache NiFi to get started doing so. Starting with simple event processing like trimming or modifying the incoming data, NiFi could handle most use cases. There is even a detectDuplicate processor in NiFi which may be of use if deduplication is part of your cleansing process. When you start looking at aggregations/windowing, or complex cleaning/transformations Apache Storm (part of HDF), or Spark may be your best bet.

This demo actually shows HDF in action doing cleaning/transformations for Anomaly Detection:

https://hortonworks.com/webinar/solving-credit-card-fraud-challenges-hadoop/

Showing cleaned data with HDF & Hadoop: http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/

If you have more specific questions I'm sure we can narrow down and help provide more detailed help.

Highlighted

Re: Data Mining - Data Pre-Processing Use Case

Super Guru

HDF, Spark, Sqoop, Flume, some Python scripts and you can pretty much clean any messy data.

I like to keep the raw data though, just in case.

https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.h...

https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.h...

See previous answers: https://community.hortonworks.com/questions/41874/data-cleaning-before-storing-in-hdfs.html

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.DetectDuplicat...

http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

https://community.hortonworks.com/articles/58265/analyzing-images-in-hdf-20-using-tensorflow.html https://community.hortonworks.com/articles/61180/streaming-ingest-of-google-sheets-into-a-connected.... https://community.hortonworks.com/articles/61717/ingesting-jms-messages-to-hdfs-via-hdf-20.html

https://community.hortonworks.com/articles/59394/csv-to-avro-conversion-with-nifi.html https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.... https://community.hortonworks.com/articles/59975/ingesting-edi-into-hdfs-using-hdf-20.html https://community.hortonworks.com/content/kbentry/59975/ingesting-edi-into-hdfs-using-hdf-20.html https://community.hortonworks.com/content/kbentry/59394/csv-to-avro-conversion-with-nifi.html

https://community.hortonworks.com/articles/34362/parsing-apache-log-files-with-spark.html https://community.hortonworks.com/content/kbentry/54947/reading-opendata-json-and-storing-into-phoen...

Don't have an account?
Coming from Hortonworks? Activate your account here