Support Questions

m2014227 · ‎08-25-2016

Hi people, I've a dataset with the following schema: ID_Employee - INT Employee_Birth - DATE Employee_Salary - DOUBLE Quantity_Products - INT And I want to find if my fields have outliers. I already read that a good practice is to use the Standard Deviation method, but in your opinion what's the best option to identify outliers or missing values? Thanks!

mqureshi · ‎08-25-2016

@Johnny Fugers

First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.

Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.

But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.

View solution in original post

mqureshi · ‎08-25-2016

@Johnny Fugers

First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.

Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.

But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.

m2014227 · ‎08-25-2016

It's possible to use Hadoop to identify my dataset distribution?

mqureshi · ‎08-25-2016

@Johnny Fugers

That question is little ambiguous. Do you mean if Hadoop provide out of the box tools where you push data and it tells you what distribution you have? The answer is no. But, normally like outside of Hadoop, you will assume a distribution for your data and then verify if data agrees with your assumption. That for sure you can do. Use Spark to do that. Check this link. Or use Python. Check PySpark also. Or even R.

Cloudera Community

Support Questions

Identify Outliers using Hive