<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: calculating median on grouped data in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40137#M26113</link>
    <description>&lt;P&gt;Calculating a median or other quantiles is in general much harder than computing a moment like a mean. You want to look for functions like Spark that compute quantiles, rather than look for a median function -- median is the 0.5 quantile. There is an efficient approximate implementation for DataFrames in Spark.&lt;/P&gt;</description>
    <pubDate>Mon, 25 Apr 2016 23:10:22 GMT</pubDate>
    <dc:creator>srowen</dc:creator>
    <dc:date>2016-04-25T23:10:22Z</dc:date>
    <item>
      <title>calculating median on grouped data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40131#M26112</link>
      <description>&lt;P&gt;Hello! I was trying to use spark to calculate median on grouped values in a dataframe, but have not had much success. I have tried using agg(), but median() is not available; tried to apply rank() to window function but the rank was not grouped; also tried to pivot the table to avoid the grouped step but the data frame is huge (8million rows) and it fails multiple times. Calculating median should be something straightforward to do since data analysts use it a lot. Maybe I'm missing something obvious?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!!&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:15:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40131#M26112</guid>
      <dc:creator>ffmm</dc:creator>
      <dc:date>2022-09-16T10:15:36Z</dc:date>
    </item>
    <item>
      <title>Re: calculating median on grouped data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40137#M26113</link>
      <description>&lt;P&gt;Calculating a median or other quantiles is in general much harder than computing a moment like a mean. You want to look for functions like Spark that compute quantiles, rather than look for a median function -- median is the 0.5 quantile. There is an efficient approximate implementation for DataFrames in Spark.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Apr 2016 23:10:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40137#M26113</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2016-04-25T23:10:22Z</dc:date>
    </item>
    <item>
      <title>Re: calculating median on grouped data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40311#M26114</link>
      <description>&lt;P&gt;Thanks! Yes percent_rank() and window function together did the trick. A different way is to sort the column and pick&amp;nbsp;the one that is in the middle. The results are close.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Apr 2016 13:32:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/calculating-median-on-grouped-data/m-p/40311#M26114</guid>
      <dc:creator>ffmm</dc:creator>
      <dc:date>2016-04-29T13:32:09Z</dc:date>
    </item>
  </channel>
</rss>

