Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Do we have any in-built machine learning algorithm support in hortonworks for automatic data quality detection?

Highlighted

Do we have any in-built machine learning algorithm support in hortonworks for automatic data quality detection?

Rising Star

Hi everyone,

I don't have end-to-end example or script flow for my use case but there is an idea in my mind regarding "Automatic tag Detection" likewise "waterlinedata" tool does for us.

I am just looking for any pre-available hortonworks tool/library which will analyze my data and will come out with results(i.e. by from learning data the tool/library will suggest me the Tag).

for example,

if have employee dataset and there are two columns in it SSN and "date_of_birth". so library or tool will learn this employee dataset and will suggest me that, these column should be tagged unser PII(Personal Identificable Infomration).

so is it possible in hortonworks or in any other tools/library?

I think we can achieve same thing using python scikit-learn library but can we do it using hortonworks in-built algorithm?

Thanks in advance.

1 REPLY 1

Re: Do we have any in-built machine learning algorithm support in hortonworks for automatic data quality detection?

@Manoj Dhake

Machine Learning can be achieved using Spark and Zeppelin in Hortonworks distribution.

Here is a Hortonworks Tutorial on this and you can probably achieve your goal using this albeit with some modifications.