<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question S3 loading into HDFS in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90678#M35262</link>
    <description>&lt;P&gt;Hi All&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need&amp;nbsp;to be stored in HDFS...&amp;nbsp;In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3?&amp;nbsp;&lt;/P&gt;&lt;P&gt;If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;CK&lt;/P&gt;</description>
    <pubDate>Sun, 19 May 2019 13:17:27 GMT</pubDate>
    <dc:creator>CK71</dc:creator>
    <dc:date>2019-05-19T13:17:27Z</dc:date>
    <item>
      <title>S3 loading into HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90678#M35262</link>
      <description>&lt;P&gt;Hi All&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need&amp;nbsp;to be stored in HDFS...&amp;nbsp;In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3?&amp;nbsp;&lt;/P&gt;&lt;P&gt;If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;CK&lt;/P&gt;</description>
      <pubDate>Sun, 19 May 2019 13:17:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90678#M35262</guid>
      <dc:creator>CK71</dc:creator>
      <dc:date>2019-05-19T13:17:27Z</dc:date>
    </item>
    <item>
      <title>Re: S3 loading into HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90685#M35263</link>
      <description>You do not need to pull files into HDFS as a step in your processing, as&lt;BR /&gt;CDH provides inbuilt connectors to pull input/write output directly from S3&lt;BR /&gt;storage (s3a:// URIs, backed by some configurations that provide&lt;BR /&gt;credentials and targets).&lt;BR /&gt;&lt;BR /&gt;This page is a good starting reference to setting up S3 access over Cloud&lt;BR /&gt;installations:&lt;BR /&gt;&lt;A href="https://www.cloudera.com/documentation/director/latest/topics/director_s3_object_storage.html" target="_blank"&gt;https://www.cloudera.com/documentation/director/latest/topics/director_s3_object_storage.html&lt;/A&gt;&lt;BR /&gt;-&lt;BR /&gt;make sure to checkout the page links from the opening paragraph too.&lt;BR /&gt;</description>
      <pubDate>Mon, 20 May 2019 01:04:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90685#M35263</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2019-05-20T01:04:15Z</dc:date>
    </item>
    <item>
      <title>Re: S3 loading into HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90695#M35264</link>
      <description>Hi&lt;BR /&gt;&lt;BR /&gt;Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL?&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;CK&lt;BR /&gt;</description>
      <pubDate>Mon, 20 May 2019 07:20:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90695#M35264</guid>
      <dc:creator>CK71</dc:creator>
      <dc:date>2019-05-20T07:20:15Z</dc:date>
    </item>
    <item>
      <title>Re: S3 loading into HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90696#M35265</link>
      <description>You can apply the queries directly on that external table. Hive will use&lt;BR /&gt;HDFS for any transient storage it requires as part of the query stages.&lt;BR /&gt;&lt;BR /&gt;Of course, if it is a set of queries overall, you can also store all the&lt;BR /&gt;intermediate temporary tables on HDFS in the way you describe, but the&lt;BR /&gt;point am trying to make is that you do not need to copy the original data&lt;BR /&gt;as-is, just allow Hive to read off of S3/write into S3 at the points that&lt;BR /&gt;matter.&lt;BR /&gt;</description>
      <pubDate>Mon, 20 May 2019 08:14:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90696#M35265</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2019-05-20T08:14:15Z</dc:date>
    </item>
    <item>
      <title>Re: S3 loading into HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90701#M35266</link>
      <description>Thanks for this. I think, we can summarize this as follows:&lt;BR /&gt;&lt;BR /&gt;* If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved.&lt;BR /&gt;* If External &amp;amp; Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved.&lt;BR /&gt;* If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.&lt;BR /&gt;</description>
      <pubDate>Mon, 20 May 2019 11:18:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/S3-loading-into-HDFS/m-p/90701#M35266</guid>
      <dc:creator>CK71</dc:creator>
      <dc:date>2019-05-20T11:18:15Z</dc:date>
    </item>
  </channel>
</rss>

