Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Identify Outliers using Hive

avatar
Contributor

Hi people, I've a dataset with the following schema: ID_Employee - INT Employee_Birth - DATE Employee_Salary - DOUBLE Quantity_Products - INT And I want to find if my fields have outliers. I already read that a good practice is to use the Standard Deviation method, but in your opinion what's the best option to identify outliers or missing values? Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru
@Johnny Fugers

First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.

Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.

But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.

View solution in original post

3 REPLIES 3

avatar
Super Guru
@Johnny Fugers

First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.

Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.

But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.

avatar
Contributor

It's possible to use Hadoop to identify my dataset distribution?

avatar
Super Guru

@Johnny Fugers

That question is little ambiguous. Do you mean if Hadoop provide out of the box tools where you push data and it tells you what distribution you have? The answer is no. But, normally like outside of Hadoop, you will assume a distribution for your data and then verify if data agrees with your assumption. That for sure you can do. Use Spark to do that. Check this link. Or use Python. Check PySpark also. Or even R.