<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: spark 2.1.0 Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset.. in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158220#M53240</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1952/suri-1415.html" nodeid="1952"&gt;@BigDataRocks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I believe you need to escape the wildcard: &lt;STRONG&gt;val df = spark.sparkContext.textFile("s3n://..../\*.gz).&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Additionally, the S3N filesystem client, while widely used, is no longer undergoing active maintenance except for emergency security issues. The S3A filesystem client can read all files created by S3N. Accordingly it should be used wherever possible.&lt;/P&gt;&lt;P&gt;Please see: &lt;A href="https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md"&gt;https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md&lt;/A&gt; for the s3a classpath dependencies and authentication properties you need to be aware of.&lt;/P&gt;&lt;P&gt;A nice tutorial on this subject can be found here: &lt;A href="https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html"&gt;https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 05 Feb 2017 01:38:39 GMT</pubDate>
    <dc:creator>tmccuch</dc:creator>
    <dc:date>2017-02-05T01:38:39Z</dc:date>
    <item>
      <title>spark 2.1.0 Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset..</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158219#M53239</link>
      <description>&lt;P&gt;Just wondering if spark supports Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset.. I think we can read as RDD but its still not working for me. Any help would be appreciated. Thank you.&lt;/P&gt;&lt;P&gt;iam using s3n://.. but spark says invalid input path exception.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;val df = spark.sparkContext.textFile("s3n://..../*.gz)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;doesnt work for me &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I prefer to the s3 dir of .gz files as a DF or Dataset if possible else atleast RDD please. thank you&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2017 01:59:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158219#M53239</guid>
      <dc:creator>bigspark</dc:creator>
      <dc:date>2017-02-03T01:59:03Z</dc:date>
    </item>
    <item>
      <title>Re: spark 2.1.0 Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset..</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158220#M53240</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1952/suri-1415.html" nodeid="1952"&gt;@BigDataRocks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I believe you need to escape the wildcard: &lt;STRONG&gt;val df = spark.sparkContext.textFile("s3n://..../\*.gz).&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Additionally, the S3N filesystem client, while widely used, is no longer undergoing active maintenance except for emergency security issues. The S3A filesystem client can read all files created by S3N. Accordingly it should be used wherever possible.&lt;/P&gt;&lt;P&gt;Please see: &lt;A href="https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md"&gt;https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md&lt;/A&gt; for the s3a classpath dependencies and authentication properties you need to be aware of.&lt;/P&gt;&lt;P&gt;A nice tutorial on this subject can be found here: &lt;A href="https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html"&gt;https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 05 Feb 2017 01:38:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158220#M53240</guid>
      <dc:creator>tmccuch</dc:creator>
      <dc:date>2017-02-05T01:38:39Z</dc:date>
    </item>
    <item>
      <title>Re: spark 2.1.0 Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset..</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158221#M53241</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1952/suri-1415.html" nodeid="1952"&gt;@BigDataRocks&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Please let me know if this helped answer your question.&lt;/P&gt;&lt;P&gt;Thanks. Tom&lt;/P&gt;</description>
      <pubDate>Wed, 08 Feb 2017 22:00:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158221#M53241</guid>
      <dc:creator>tmccuch</dc:creator>
      <dc:date>2017-02-08T22:00:27Z</dc:date>
    </item>
    <item>
      <title>Re: spark 2.1.0 Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset..</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158222#M53242</link>
      <description>&lt;P&gt;There's also the documentation here: &lt;A href="https://hortonworks.github.io/hdp-aws/s3-spark/"&gt;https://hortonworks.github.io/hdp-aws/s3-spark/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Feb 2017 03:21:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/spark-2-1-0-Reading-gz-files-from-an-s3-bucket-or-dir-as-a/m-p/158222#M53242</guid>
      <dc:creator>stevel</dc:creator>
      <dc:date>2017-02-14T03:21:32Z</dc:date>
    </item>
  </channel>
</rss>

