<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Parquet data duplication in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103242#M66159</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt;&lt;/P&gt;&lt;P&gt;1- You are using HBASE for very fast lookup/near real time data access - Yes it's ok. &lt;/P&gt;&lt;P&gt;2- You want to store data into HDFS -  Yes, it's ok and it can serve many use cases down the road. You can have this data for long time. Create hive tables on top of this data for analytics or reporting. &lt;/P&gt;</description>
    <pubDate>Tue, 19 Jan 2016 21:22:37 GMT</pubDate>
    <dc:creator>nsabharwal</dc:creator>
    <dc:date>2016-01-19T21:22:37Z</dc:date>
    <item>
      <title>Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103235#M66152</link>
      <description>&lt;P&gt;Hello All, &lt;/P&gt;&lt;P&gt;It's sur that parquet files make OLAP queries faster cause of it columnar format, but in the other side the datalake is duplicated (raw data + parquet data).
even if parquet can be compressed, dont you think that duplicating all the data can costs a lot ?&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 18:31:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103235#M66152</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T18:31:24Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103236#M66153</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227" target="_blank"&gt;@Mehdi TAZI&lt;/A&gt;
&lt;/P&gt;&lt;P&gt; I am big fan of orc&lt;/P&gt;&lt;P&gt;&lt;A href="http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/" target="_blank" rel="nofollow noopener noreferrer"&gt;http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="1456-orcfile.png" style="width: 1323px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/23651i2252C6724BA4F56A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="1456-orcfile.png" alt="1456-orcfile.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 12:08:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103236#M66153</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2019-08-19T12:08:22Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103237#M66154</link>
      <description>&lt;P&gt;Thanks for your answer, but my question wasn't about comparing the compression rate, actually we need both of the original and columnar files. so is it normal to duplicate all the datalake to have better performance ?&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 19:06:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103237#M66154</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T19:06:05Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103238#M66155</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt; No and I never heard of duplicating the data with Parquet. I hope you are not referring to HDFS replication factor. If you are then please see &lt;A target="_blank" href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication"&gt;this&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 19:14:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103238#M66155</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T19:14:20Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103239#M66156</link>
      <description>&lt;P&gt;i think i didn't explain well my point, let's assume a system that receives data from outside sources, normally we store the raw data in HDFS/HBASE in order to keep it in it original format.&lt;/P&gt;&lt;P&gt;now let's assume that we want make ad-hoc faster queries, so we convert all the data to parquet format and of course keep the raw one ! ( this is the duplication that i'm talking about)&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 19:26:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103239#M66156</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T19:26:16Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103240#M66157</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt;  Very good point. It goes back to ELT ..Source of truth "raw data" lands in HDFS, we run transformations on that data and load into Hive or HBASE based on used case. There is significant cost difference in storing the source of truth in Hadoop vs. Expensive SAN or EDW.&lt;/P&gt;&lt;P&gt;You don't have to store in HDFS. You can load data directly into Hive or HBase tables. The very basic use case i,e Data archival. You can "move" data from EDW into Hive using sqoop. Data goes directly into hive tables.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 19:33:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103240#M66157</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T19:33:41Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103241#M66158</link>
      <description>&lt;P&gt;Excuse me, i didn't understand your answer.&lt;/P&gt;&lt;P&gt;here is a typical case : i have a job that reads raw data from a source(ex : kafka) to store them into the datalake(HBase over HDFS) for an archiving purpose, and at the same time this same job create parquet files that stores on HDFS for an analytics purpose. here we are saving the same data in different formats for too diferente purposes, so the same data is duplicated.&lt;/P&gt;&lt;P&gt;1 - is-it the right one to do that&lt;/P&gt;&lt;P&gt;2 - if yes is it normal that the data is duplicated.&lt;/P&gt;&lt;P&gt;thanks a lot !&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 21:09:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103241#M66158</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T21:09:51Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103242#M66159</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt;&lt;/P&gt;&lt;P&gt;1- You are using HBASE for very fast lookup/near real time data access - Yes it's ok. &lt;/P&gt;&lt;P&gt;2- You want to store data into HDFS -  Yes, it's ok and it can serve many use cases down the road. You can have this data for long time. Create hive tables on top of this data for analytics or reporting. &lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 21:22:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103242#M66159</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T21:22:37Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103243#M66160</link>
      <description>&lt;P&gt;this is exactly what i was doing, but the actual architecture includes generated parquet files to improve performance and it works !, the only side effect is data duplication, so i was wondering if there is an other technology that will allow me to improve performance without having this side effect.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 21:45:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103243#M66160</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T21:45:35Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103244#M66161</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt; HBASE and HDFS is really good combination. You don't have to store everything in HBASE. You can store field that required for your application. Having compression for HBASE and Hive table "orc" will help you to reduce your storage foot print. &lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 22:11:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103244#M66161</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T22:11:25Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103245#M66162</link>
      <description>&lt;P&gt;i agree, but actually the orc part will be duplicated no ? &lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 22:24:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103245#M66162</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T22:24:51Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103246#M66163</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt; Better compression mean less storage cost. My suggestion is not to confuse HBASE or Nosql with HDFS. There are customer who are using HDFS, Hive and not using HBASE. HBASE is designed for special use cases where you have to access data in real time "You have mentioned this already" &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 22:36:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103246#M66163</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T22:36:12Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103247#M66164</link>
      <description>&lt;P&gt;yes thanks ^^, in my case i'm using hbase because i'm handling a large amount of small files.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 23:18:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103247#M66164</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-19T23:18:37Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103248#M66165</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt; That's sound correct. I did connect with you on twitter. Feel free to connect back and we can discuss in detail. I do believe that you are on the right track&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jan 2016 23:27:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103248#M66165</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-01-19T23:27:33Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103249#M66166</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/1227/gtmehdi.html" nodeid="1227"&gt;@Mehdi TAZI&lt;/A&gt;&lt;P&gt;in one of your deleted responses you'd mentioned that you duplicate date for hive queries and hbase for small files issues. You can actually map hive to hbase and use analytics queries on top of HBase. That may not be the most efficient way but you can also map HBase snapshots to Hive and that will be a lot better as far as HBase is concerned.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2016 10:40:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103249#M66166</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-01-20T10:40:05Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet data duplication</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103250#M66167</link>
      <description>&lt;P&gt;first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2016 17:44:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Parquet-data-duplication/m-p/103250#M66167</guid>
      <dc:creator>TAZIMehdi</dc:creator>
      <dc:date>2016-01-20T17:44:22Z</dc:date>
    </item>
  </channel>
</rss>

