<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Which storage format is optimum for training machine learning models and running iterative processes? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Which-storage-format-is-optimum-for-training-machine/m-p/219241#M82039</link>
    <description>&lt;P&gt;Assuming a data pipeline will be loading hive tables as spark dataframes. Which storage format is optimum for training machine learning models and running iterative processes? Row based (text, Avro) or column based (Orc, Parquet) files?&lt;/P&gt;</description>
    <pubDate>Mon, 13 Aug 2018 21:15:29 GMT</pubDate>
    <dc:creator>marshall_felder</dc:creator>
    <dc:date>2018-08-13T21:15:29Z</dc:date>
    <item>
      <title>Which storage format is optimum for training machine learning models and running iterative processes?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Which-storage-format-is-optimum-for-training-machine/m-p/219241#M82039</link>
      <description>&lt;P&gt;Assuming a data pipeline will be loading hive tables as spark dataframes. Which storage format is optimum for training machine learning models and running iterative processes? Row based (text, Avro) or column based (Orc, Parquet) files?&lt;/P&gt;</description>
      <pubDate>Mon, 13 Aug 2018 21:15:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Which-storage-format-is-optimum-for-training-machine/m-p/219241#M82039</guid>
      <dc:creator>marshall_felder</dc:creator>
      <dc:date>2018-08-13T21:15:29Z</dc:date>
    </item>
    <item>
      <title>Re: Which storage format is optimum for training machine learning models and running iterative processes?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Which-storage-format-is-optimum-for-training-machine/m-p/219242#M82040</link>
      <description>&lt;P&gt;ORC and Parquet are optimized for OLAP queries since only a subset of the columns from the source tables are used.  Avro and other row based perform better if you have to look at entire record.  Hav from one datatype to another (multi-hive table approach) is a common practice to determine which format performs the best for your use case.  Performance test all three types is my recommendation.  There is no one size fits all.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Aug 2018 21:20:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Which-storage-format-is-optimum-for-training-machine/m-p/219242#M82040</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2018-08-13T21:20:53Z</dc:date>
    </item>
  </channel>
</rss>

