<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best way to analyze and transform big data in Hadoop in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121113#M30793</link>
    <description>&lt;P&gt;NiFi will do that very easily, then you can trigger some Spark jobs to do final processing.&lt;/P&gt;</description>
    <pubDate>Mon, 06 Jun 2016 23:05:33 GMT</pubDate>
    <dc:creator>TimothySpann</dc:creator>
    <dc:date>2016-06-06T23:05:33Z</dc:date>
    <item>
      <title>Best way to analyze and transform big data in Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121109#M30789</link>
      <description>I'm planning analyse some data using Hadoop. I've 200 text files to analyze.
 I'm thinking:
&lt;UL&gt;&lt;LI&gt;Using Spark to load data into HDFS (are PIG or Sqoop better?)&lt;/LI&gt;&lt;LI&gt;Create the structure in Hive, creating the tables (basically this first data model will have 200 tables, each table will be a text file)&lt;/LI&gt;&lt;LI&gt;Load data into Hive (all the files)&lt;/LI&gt;&lt;LI&gt;Do some data cleansing with Spark (I will need to put Spark reading from the Hive) and try to reduce the amount of data&lt;/LI&gt;&lt;LI&gt;Create the new data model in Hive (now with a smaller amount of data after the cleansing in previous step)&lt;/LI&gt;&lt;LI&gt;Use a Analytical Tool (like SAS, Tableau, etc.) to do some analytical operations (in this tool I will put the all the data returned in previous step)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I believe that this will not be the best way to analyze big data . My goal is in the end of the process in Hadoop have a smaller data set in order to successfully integrate in SAS , for example. 

What is your opinion ?

Many thanks!&lt;/P&gt;</description>
      <pubDate>Sun, 05 Jun 2016 23:21:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121109#M30789</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-06-05T23:21:22Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to analyze and transform big data in Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121110#M30790</link>
      <description>&lt;P&gt;Can you say a little bit more about the text files?  Are they all the same kind of data and format, or different?  How big are the text files in terms of GB and number of rows/columns?&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 11:01:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121110#M30790</guid>
      <dc:creator>paul_boal</dc:creator>
      <dc:date>2016-06-06T11:01:45Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to analyze and transform big data in Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121111#M30791</link>
      <description>Hi Paul, thnaks for your attention.  My goal is do some Social Analysis (find patterns, etc.) that's why I want SAS too.

The subject is the relationships between a company. I've the emails, telephones, etc.
What I've:

5 months data collection (Aug, Set, Oct, Nov and Dec)
&lt;UL&gt;&lt;LI&gt;Each text file correspond to a day &lt;/LI&gt;&lt;LI&gt;Each type of communication have an specific ID (imagine, email I've ID 1, Phone ID 2, etc.)&lt;/LI&gt;&lt;LI&gt;Each line corresponds to an aggregation of multiple communications (separated by the department and every 30 minutes)&lt;/LI&gt;&lt;LI&gt;The attributes are:
&lt;UL&gt;&lt;LI&gt;Communication ID 
&lt;/LI&gt;&lt;LI&gt;Time&lt;/LI&gt;&lt;LI&gt;Department&lt;/LI&gt;&lt;LI&gt;Email Code&lt;/LI&gt;&lt;LI&gt;Phone Code&lt;/LI&gt;&lt;LI&gt;Phone Duration&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;One possible line of the text file would be:
1  10:30:87  3  12  1  10:30:22  1  10:45:21 3  12  2  10:30:22 2  12  2  10:30:22 1  12     10:30:22

So as you can see, I can have multiple Communication ID by line (that's one of my doubts to create the Hive tables).

The size of the text files are 6GB.

Many thanks for your help Paul &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; Hope you can understand the problem. Thanks!
&lt;STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 19:48:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121111#M30791</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-06-06T19:48:01Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to analyze and transform big data in Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121112#M30792</link>
      <description>&lt;P&gt;I see different issues here:&lt;/P&gt;&lt;P&gt;a) Split up a line with multiple records&lt;/P&gt;&lt;P&gt;If you have multiple communications by line you will need some pre processing. Hive provides maps and arrays but it is hard to use them in normal SQL.&lt;/P&gt;&lt;P&gt;So tons of different ways but my suggestion would be to write a Pig UDF to split up a line into multiples potentially adding a column that adds the line information if you need to group them together somehow.&lt;/P&gt;&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/11287362/splitting-a-tuple-into-multiple-tuples-in-pig" target="_blank"&gt;http://stackoverflow.com/questions/11287362/splitting-a-tuple-into-multiple-tuples-in-pig&lt;/A&gt;&lt;/P&gt;&lt;P&gt;b) Get date from Filename&lt;/P&gt;&lt;P&gt;There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there:&lt;/P&gt;&lt;P&gt;1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file  taken from the filename. Easy but not that scalable&lt;/P&gt;&lt;P&gt;2) Write your own recordreader &lt;/P&gt;&lt;P&gt;3) Pig seems to provide some value called tagsource that can do the same&lt;/P&gt;&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-pig-latin-script" target="_blank"&gt;http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-pig-latin-script&lt;/A&gt;&lt;/P&gt;&lt;P&gt;c) Do Graph analysis&lt;/P&gt;&lt;P&gt;You can use Hive/pig/Spark for preprocessing and Spark provides a cool graph api. Tons of examples out there.&lt;/P&gt;&lt;P&gt;&lt;A href="http://spark.apache.org/graphx/" target="_blank"&gt;http://spark.apache.org/graphx/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Good luck. &lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 22:53:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121112#M30792</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-06-06T22:53:44Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to analyze and transform big data in Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121113#M30793</link>
      <description>&lt;P&gt;NiFi will do that very easily, then you can trigger some Spark jobs to do final processing.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 23:05:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-analyze-and-transform-big-data-in-Hadoop/m-p/121113#M30793</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-06-06T23:05:33Z</dc:date>
    </item>
  </channel>
</rss>

