<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Reading-analysing-Json-file-with-about-1TB-size-in-Spark/m-p/284059#M210988</link>
    <description>&lt;P&gt;I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Kafka). But it's very slow. Here what I do:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. Data is downloaded from &lt;A href="https://dumps.wikimedia.org/wikidatawiki/entities/" target="_self"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes&lt;/P&gt;
&lt;P&gt;4. rdd.take(10) , ... these are ok&lt;/P&gt;
&lt;P&gt;5. It was not possible to unzip the file; I read it with data.json.gz&lt;/P&gt;
&lt;P&gt;Any suggestion? How I can read it with json reader?&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Tue, 26 Nov 2019 13:55:59 GMT</pubDate>
    <dc:creator>Maryam</dc:creator>
    <dc:date>2019-11-26T13:55:59Z</dc:date>
    <item>
      <title>Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Reading-analysing-Json-file-with-about-1TB-size-in-Spark/m-p/284059#M210988</link>
      <description>&lt;P&gt;I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Kafka). But it's very slow. Here what I do:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. Data is downloaded from &lt;A href="https://dumps.wikimedia.org/wikidatawiki/entities/" target="_self"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes&lt;/P&gt;
&lt;P&gt;4. rdd.take(10) , ... these are ok&lt;/P&gt;
&lt;P&gt;5. It was not possible to unzip the file; I read it with data.json.gz&lt;/P&gt;
&lt;P&gt;Any suggestion? How I can read it with json reader?&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2019 13:55:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Reading-analysing-Json-file-with-about-1TB-size-in-Spark/m-p/284059#M210988</guid>
      <dc:creator>Maryam</dc:creator>
      <dc:date>2019-11-26T13:55:59Z</dc:date>
    </item>
    <item>
      <title>Re: Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Reading-analysing-Json-file-with-about-1TB-size-in-Spark/m-p/284076#M210995</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/70154"&gt;@Maryam&lt;/a&gt;&amp;nbsp;&lt;SPAN&gt;While we welcome your question, you would be much more likely to obtain a useful answer if you posted this to the&amp;nbsp;&lt;A href="https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight" target="_blank" rel="noopener nofollow"&gt;the&amp;nbsp;appropriate forum for Microsoft Azure Hdinsight&lt;/A&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2019 15:02:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Reading-analysing-Json-file-with-about-1TB-size-in-Spark/m-p/284076#M210995</guid>
      <dc:creator>ask_bill_brooks</dc:creator>
      <dc:date>2019-11-26T15:02:50Z</dc:date>
    </item>
  </channel>
</rss>

