Created 08-25-2016 09:29 PM
Hi people, I've a dataset with the following schema: ID_Employee - INT Employee_Birth - DATE Employee_Salary - DOUBLE Quantity_Products - INT And I want to find if my fields have outliers. I already read that a good practice is to use the Standard Deviation method, but in your opinion what's the best option to identify outliers or missing values? Thanks!
Created 08-25-2016 09:45 PM
First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.
Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.
But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.
Created 08-25-2016 09:45 PM
First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.
Are you familiar with Grubb's test? You will have to write your own UDF to do that using Hive.
But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this link.
Created 08-25-2016 09:51 PM
It's possible to use Hadoop to identify my dataset distribution?
Created 08-25-2016 10:51 PM
That question is little ambiguous. Do you mean if Hadoop provide out of the box tools where you push data and it tells you what distribution you have? The answer is no. But, normally like outside of Hadoop, you will assume a distribution for your data and then verify if data agrees with your assumption. That for sure you can do. Use Spark to do that. Check this link. Or use Python. Check PySpark also. Or even R.