<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Storage data in HDFS - What's next? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119073#M26394</link>
    <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/2985/antonio-scp125.html" nodeid="2985"&gt;@Pedro Alves&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.&lt;/P&gt;</description>
    <pubDate>Thu, 28 Apr 2016 04:01:36 GMT</pubDate>
    <dc:creator>ahadjidj</dc:creator>
    <dc:date>2016-04-28T04:01:36Z</dc:date>
    <item>
      <title>Storage data in HDFS - What's next?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119070#M26391</link>
      <description>&lt;P&gt;Hi experts,&lt;/P&gt;&lt;P&gt;I was used to the usual data warehousing process:
Source Date - ETL
Now I'm using Hadoop and I'm a bit confusing...
I have inserted the data in HDFS but now would like to understand better the data and apply some segmentations ( by profile, for example). I like to use Flume , Spark, Impala and Hive but I am not able to combine well the function of each or when I should apply each them. 

Does anyone have any idea what are the usual processde Big Data before applying any kind of analytics ?

Many thanks!!!&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 00:28:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119070#M26391</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-04-28T00:28:30Z</dc:date>
    </item>
    <item>
      <title>Re: Storage data in HDFS - What's next?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119071#M26392</link>
      <description>&lt;P&gt;This is the common process many go through and many ways to skin the cat here.  I prefer the below methodology.&lt;/P&gt;&lt;P&gt;1.  Bring in the data with minimal transformation the "E" and "L".  Depending on workload this could be sqoop for simple batch or NiFi for a more modern streaming approach with better control over flow, bi-direction and back pressure.    &lt;/P&gt;&lt;P&gt;2.  Decide on a transformation strategy and store a higher level or "enriched" data set typically in Hive or HBase.  Now between Atlas and NiFi you should have some data lineage.  Other formatting might take place here with native datatypes dates vs timestamps.  Likely a partitioning strategy would take place here.  Running a data cleansing strategy at this phase is also a good idea as well as computing feature vectors.  &lt;/P&gt;&lt;P&gt;3.  Use zeppelin + spark to analyze the data.  &lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 00:40:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119071#M26392</guid>
      <dc:creator>khaslbeck</dc:creator>
      <dc:date>2016-04-28T00:40:55Z</dc:date>
    </item>
    <item>
      <title>Re: Storage data in HDFS - What's next?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119072#M26393</link>
      <description>&lt;P&gt;Hi Kirk, thank you for your brilliant response. So, the data cleansing strategy occurs with Hive and Impala, and only then we use Spark for analyze.

Thanks! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 03:30:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119072#M26393</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-04-28T03:30:35Z</dc:date>
    </item>
    <item>
      <title>Re: Storage data in HDFS - What's next?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119073#M26394</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/2985/antonio-scp125.html" nodeid="2985"&gt;@Pedro Alves&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 04:01:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119073#M26394</guid>
      <dc:creator>ahadjidj</dc:creator>
      <dc:date>2016-04-28T04:01:36Z</dc:date>
    </item>
    <item>
      <title>Re: Storage data in HDFS - What's next?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119074#M26395</link>
      <description>&lt;P&gt;Hi Abdelkrim, thanks for your response. 

In this case I don't have a big nkowlodge about the source data, so what I'm thinking is:
-&amp;gt; Put Data in HDFS
-&amp;gt; Know the Data with Hive and Impala (simple querys and create some new tables for segmentation)
-&amp;gt; Apply some analysis with Spark to identify patterns between data

In your opinion, this is a good plan? :)

Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 15:52:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storage-data-in-HDFS-What-s-next/m-p/119074#M26395</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-04-28T15:52:10Z</dc:date>
    </item>
  </channel>
</rss>

