<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How spark works to analyze huge databases in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162249#M124628</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;&lt;P&gt;I wont start with 2 node cluster ..Minimum 3 to 5 nodes --&amp;gt; This is just a lab env&lt;/P&gt;&lt;P&gt;2 master, 3 DN&lt;/P&gt;&lt;P&gt;You need to deploy a cluster  - &lt;A target="_blank" href="http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Installing_HDP_AMB/content/ch_Getting_Ready.html"&gt;Link&lt;/A&gt;   Use ambari to deploy HDP&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/cartershanklin/hive-testbench" target="_blank"&gt;https://github.com/cartershanklin/hive-testbench&lt;/A&gt;  - You can generate hive data using testbench&lt;/P&gt;&lt;P&gt;then you can test sparksql&lt;/P&gt;&lt;P&gt;and &lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/databricks/spark-perf" target="_blank"&gt;https://github.com/databricks/spark-perf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Yes, you should start with Hadoop and take advantage of distributed computing framework.&lt;/P&gt;</description>
    <pubDate>Sun, 28 Feb 2016 21:24:25 GMT</pubDate>
    <dc:creator>nsabharwal</dc:creator>
    <dc:date>2016-02-28T21:24:25Z</dc:date>
    <item>
      <title>How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162248#M124627</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Im studing spark, because I read some studies about it and it seems amazing to process large volumes of data. So I was thinking to expriment this, generating 100gb of data with some benchmark like tpc and execute the queries with spark using 2 nodes, but Im with some doubts how to do this. &lt;/P&gt;&lt;P&gt;I need to install hadoop two hadoop nodes to store the tpc tables? And then execute the queries with spark against hdfs? But how we can create the tpc schema and store the tables in hadoop hdfs?Is it possible? Or it´s not necessary install hadoop and we need to use hive instead? I reading some articles about this but but I m getting a bit confused. Thanks for your attention!&lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2016 21:19:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162248#M124627</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-02-28T21:19:42Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162249#M124628</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;&lt;P&gt;I wont start with 2 node cluster ..Minimum 3 to 5 nodes --&amp;gt; This is just a lab env&lt;/P&gt;&lt;P&gt;2 master, 3 DN&lt;/P&gt;&lt;P&gt;You need to deploy a cluster  - &lt;A target="_blank" href="http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Installing_HDP_AMB/content/ch_Getting_Ready.html"&gt;Link&lt;/A&gt;   Use ambari to deploy HDP&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/cartershanklin/hive-testbench" target="_blank"&gt;https://github.com/cartershanklin/hive-testbench&lt;/A&gt;  - You can generate hive data using testbench&lt;/P&gt;&lt;P&gt;then you can test sparksql&lt;/P&gt;&lt;P&gt;and &lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/databricks/spark-perf" target="_blank"&gt;https://github.com/databricks/spark-perf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Yes, you should start with Hadoop and take advantage of distributed computing framework.&lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2016 21:24:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162249#M124628</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-02-28T21:24:25Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162250#M124629</link>
      <description>&lt;P&gt;Thanks for your help, so first I need to install hadoop cluster and uploade the tables (.tbl) into hadoop? And then also create the schema and store tables into hive? &lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2016 21:30:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162250#M124629</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-02-28T21:30:44Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162251#M124630</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Yes&lt;/P&gt;&lt;P&gt;1) Setup cluster&lt;/P&gt;&lt;P&gt;2) load data  - link shared &lt;/P&gt;&lt;P&gt;3) you can use spark git too &lt;A href="https://github.com/databricks/spark-perf"&gt;https://github.com/databricks/spark-perf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The main step is to build a cluster then options are unlimited. &lt;/P&gt;</description>
      <pubDate>Sun, 28 Feb 2016 21:50:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162251#M124630</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-02-28T21:50:39Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162252#M124631</link>
      <description>&lt;P&gt;Thanks again for your help. So ok, the first step is to setup a hadoop cluster. But the link that you share "&lt;A href="https://github.com/databricks/spark-perf"&gt;https://github.com/databricks/spark-perf&lt;/A&gt;" has a step that has this title "Running on exinsting spark cluster". So if we want to execute some queries with park it isnt possible create a spark cluster with 4 nodes and store the tables there instead of create an hadoop cluster?&lt;/P&gt;</description>
      <pubDate>Mon, 29 Feb 2016 05:35:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162252#M124631</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-02-29T05:35:23Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162253#M124632</link>
      <description>&lt;P&gt;And also, hive is really necessary? We can´t have only the hadoop cluster with the tables data and execute queries with spark against the hadoop without hive?&lt;/P&gt;</description>
      <pubDate>Mon, 29 Feb 2016 07:18:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162253#M124632</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-02-29T07:18:44Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162254#M124633</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;  See this &lt;A href="http://spark.apache.org/sql/"&gt;http://spark.apache.org/sql/&lt;/A&gt;  You have various options to access strutured data.&lt;/P&gt;</description>
      <pubDate>Mon, 29 Feb 2016 08:49:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162254#M124633</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-02-29T08:49:22Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162255#M124634</link>
      <description>&lt;P&gt;Thanks again. I read the link but Im still with doubt. It is really necessary install hadoop, then hive, than create the database schema in hive and load the database data in hive and then use the spark to query the hive database? Its not possible install hadoop, then load the tpc-h schema and data in hadoop and query the hadoop data with spark?  Im reading a lot of documentation but I really didnt understand the best solution for this.&lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 01:19:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162255#M124634</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-03-06T01:19:46Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162256#M124635</link>
      <description>&lt;P&gt;Because I want to test tpch queries with spark not with hive. But, its necessary to use hive as a intermediate to execute queries with spark?&lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 03:39:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162256#M124635</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-03-06T03:39:33Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162257#M124636</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;  You have options to access data. As hive/HQL is the industry standard to interact with Hadoop so users are leveraging sparsql+hive&lt;/P&gt;&lt;P&gt;Please read the overview &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#overview"&gt;http://spark.apache.org/docs/latest/sql-programming-guide.html#overview&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 09:30:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162257#M124636</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-03-06T09:30:44Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162258#M124637</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;  Please help me to close the thread if it's useful &lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 11:03:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162258#M124637</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-03-06T11:03:52Z</dc:date>
    </item>
    <item>
      <title>Re: How spark works to analyze huge databases</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162259#M124638</link>
      <description>&lt;P&gt;Thanks for your help. I just not understand why thins link: &lt;A href="https://github.com/databricks/spark-perf"&gt;https://github.com/databricks/spark-perf&lt;/A&gt; that you share is needed. Can you explain? In the first step I instal lhadoop, then Install hive and create the schema. Then I can use spark sql to execute queries against the hive schema right? So why that link its necessary? Thanks again!&lt;/P&gt;</description>
      <pubDate>Sun, 13 Mar 2016 00:33:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-spark-works-to-analyze-huge-databases/m-p/162259#M124638</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-03-13T00:33:06Z</dc:date>
    </item>
  </channel>
</rss>

