<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Identify Outliers using Hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120511#M83279</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/11031/m2014227.html" nodeid="11031"&gt;@Johnny Fugers&lt;/A&gt;&lt;P&gt;First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.&lt;/P&gt;&lt;P&gt;Are you familiar with &lt;A href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm"&gt;Grubb's test&lt;/A&gt;? You will have to write your own UDF to do that using Hive.&lt;/P&gt;&lt;P&gt;But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this &lt;A href="https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/"&gt;link&lt;/A&gt;.&lt;/P&gt;</description>
    <pubDate>Fri, 26 Aug 2016 04:45:02 GMT</pubDate>
    <dc:creator>mqureshi</dc:creator>
    <dc:date>2016-08-26T04:45:02Z</dc:date>
    <item>
      <title>Identify Outliers using Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120510#M83278</link>
      <description>&lt;P&gt;Hi people, 

I've a dataset with the following schema:
ID_Employee - INT
Employee_Birth - DATE
Employee_Salary - DOUBLE
Quantity_Products - INT

And I want to find if my fields have outliers. I already read that a good practice is to use the Standard Deviation method, but in your opinion what's the best option to identify outliers or missing values?

Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 04:29:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120510#M83278</guid>
      <dc:creator>m2014227</dc:creator>
      <dc:date>2016-08-26T04:29:07Z</dc:date>
    </item>
    <item>
      <title>Re: Identify Outliers using Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120511#M83279</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/11031/m2014227.html" nodeid="11031"&gt;@Johnny Fugers&lt;/A&gt;&lt;P&gt;First find if your data is normally distributed. If not, what's the distribution? That's what will pretty much determine which test you should be using. If data is not normally distributed, you can convert data to its normal form. So, first know your distribution.&lt;/P&gt;&lt;P&gt;Are you familiar with &lt;A href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm"&gt;Grubb's test&lt;/A&gt;? You will have to write your own UDF to do that using Hive.&lt;/P&gt;&lt;P&gt;But why do you want to limit yourself to Hive? You can read Hive data using Spark? Spark MLLib will provide you with several out of the box tools to do just that. Your data can still be read using Hive for all other things that you are doing with it but at the same time you can use Spark on same data. Check this &lt;A href="https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/"&gt;link&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 04:45:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120511#M83279</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2016-08-26T04:45:02Z</dc:date>
    </item>
    <item>
      <title>Re: Identify Outliers using Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120512#M83280</link>
      <description>&lt;P&gt;It's possible to use Hadoop to identify my dataset distribution?&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 04:51:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120512#M83280</guid>
      <dc:creator>m2014227</dc:creator>
      <dc:date>2016-08-26T04:51:55Z</dc:date>
    </item>
    <item>
      <title>Re: Identify Outliers using Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120513#M83281</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11031/m2014227.html" nodeid="11031"&gt;@Johnny Fugers&lt;/A&gt; &lt;/P&gt;&lt;P&gt;That question is little ambiguous. Do you mean if Hadoop provide out of the box tools where you push data and it tells you what distribution you have? The answer is no. But, normally like outside of Hadoop, you will assume a distribution for your data and then verify if data agrees with your assumption. That for sure you can do. Use Spark to do that. Check this &lt;A href="http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics"&gt;link&lt;/A&gt;. Or use Python. Check &lt;A href="https://spark.apache.org/docs/0.9.0/python-programming-guide.html"&gt;PySpark&lt;/A&gt; also. Or even &lt;A href="https://github.com/RevolutionAnalytics"&gt;R&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 05:51:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Identify-Outliers-using-Hive/m-p/120513#M83281</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2016-08-26T05:51:05Z</dc:date>
    </item>
  </channel>
</rss>

