<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Load MillionSongsSubset data in Pig in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-MillionSongsSubset-data-in-Pig/m-p/180056#M58526</link>
    <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I have downloaded millionsongssubset data from &lt;A href="http://MilionSongSubset"&gt;http://static.echonest.com/millionsongsubset_full.tar.gz&lt;/A&gt; and tried to upload it &lt;/P&gt;&lt;P&gt;and print sample  &lt;/P&gt;&lt;PRE&gt;songs = LOAD '/user/root/datasets/millionsongsubset_full.tar.gz'
songs_limit = LIMIT songs 10;
DUMP songs_limit;&lt;/PRE&gt;&lt;P&gt;Records are displayed as below. Please suggest how to upload above downloaded data in right format &lt;/P&gt;&lt;PRE&gt;grunt&amp;gt; DUMP songs_limit;2017-03-30 05:03:24,383 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT2017-03-30 05:03:24,458 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,459 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}2017-03-30 05:03:24,474 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 12017-03-30 05:03:24,479 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,507 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 12017-03-30 05:03:24,524 [main] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz]2017-03-30 05:03:24,609 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://sandbox.technocrafty:8020/tmp/temp-607255022/tmp-1815565156/_temporary/0/task__0001_m_0000012017-03-30 05:03:24,646 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,655 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,656 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1(MillionSongSubset/0000755000175000017500000000000011516357374014450 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/0000755000175000017500000000000011516366075017501 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/subset_unique_tracks.txt0000644000175000017500000317201311516365717024515 0ustar  thierrythierryTRAAAAW128F429D538&amp;lt;SEP&amp;gt;SOMZWCG12A8C13C480&amp;lt;SEP&amp;gt;Casual&amp;lt;SEP&amp;gt;I Didn't Mean To)(TRAAABD128F429CF47&amp;lt;SEP&amp;gt;SOCIWDW12A8C13D406&amp;lt;SEP&amp;gt;The Box Tops&amp;lt;SEP&amp;gt;Soul Deep)(TRAAADZ128F9348C2E&amp;lt;SEP&amp;gt;SOXVLOJ12AB0189215&amp;lt;SEP&amp;gt;Sonora Santanera&amp;lt;SEP&amp;gt;Amor De Cabaret)(TRAAAEF128F4273421&amp;lt;SEP&amp;gt;SONHOTT12A8C13493C&amp;lt;SEP&amp;gt;Adam Ant&amp;lt;SEP&amp;gt;Something Girls)(TRAAAFD128F92F423A&amp;lt;SEP&amp;gt;SOFSOCN12A8C143F5D&amp;lt;SEP&amp;gt;Gob&amp;lt;SEP&amp;gt;Face the Ashes)(TRAAAMO128F1481E7F&amp;lt;SEP&amp;gt;SOYMRWW12A6D4FAB14&amp;lt;SEP&amp;gt;Jeff And Sheri Easter&amp;lt;SEP&amp;gt;The Moon And I (Ordinary Day Album Version))(TRAAAMQ128F1460CD3&amp;lt;SEP&amp;gt;SOMJBYD12A6D4F8557&amp;lt;SEP&amp;gt;Rated R&amp;lt;SEP&amp;gt;Keepin It Real (Skit))(TRAAAPK128E0786D96&amp;lt;SEP&amp;gt;SOHKNRJ12A6701D1F8&amp;lt;SEP&amp;gt;Tweeterfriendly Music&amp;lt;SEP&amp;gt;Drop of Rain)(TRAAARJ128F9320760&amp;lt;SEP&amp;gt;SOIAZJW12AB01853F1&amp;lt;SEP&amp;gt;Planet P Project&amp;lt;SEP&amp;gt;Pink World)(TRAAAVG12903CFA543&amp;lt;SEP&amp;gt;SOUDSGM12AC9618304&amp;lt;SEP&amp;gt;Clp&amp;lt;SEP&amp;gt;Insatiable (Instrumental Version))&lt;/PRE&gt;&lt;P&gt;Thanks !!&lt;/P&gt;</description>
    <pubDate>Thu, 30 Mar 2017 12:17:45 GMT</pubDate>
    <dc:creator>shalini_goel</dc:creator>
    <dc:date>2017-03-30T12:17:45Z</dc:date>
    <item>
      <title>Load MillionSongsSubset data in Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-MillionSongsSubset-data-in-Pig/m-p/180056#M58526</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I have downloaded millionsongssubset data from &lt;A href="http://MilionSongSubset"&gt;http://static.echonest.com/millionsongsubset_full.tar.gz&lt;/A&gt; and tried to upload it &lt;/P&gt;&lt;P&gt;and print sample  &lt;/P&gt;&lt;PRE&gt;songs = LOAD '/user/root/datasets/millionsongsubset_full.tar.gz'
songs_limit = LIMIT songs 10;
DUMP songs_limit;&lt;/PRE&gt;&lt;P&gt;Records are displayed as below. Please suggest how to upload above downloaded data in right format &lt;/P&gt;&lt;PRE&gt;grunt&amp;gt; DUMP songs_limit;2017-03-30 05:03:24,383 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT2017-03-30 05:03:24,458 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,459 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}2017-03-30 05:03:24,474 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 12017-03-30 05:03:24,479 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,507 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 12017-03-30 05:03:24,524 [main] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz]2017-03-30 05:03:24,609 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://sandbox.technocrafty:8020/tmp/temp-607255022/tmp-1815565156/_temporary/0/task__0001_m_0000012017-03-30 05:03:24,646 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,655 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,656 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1(MillionSongSubset/0000755000175000017500000000000011516357374014450 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/0000755000175000017500000000000011516366075017501 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/subset_unique_tracks.txt0000644000175000017500000317201311516365717024515 0ustar  thierrythierryTRAAAAW128F429D538&amp;lt;SEP&amp;gt;SOMZWCG12A8C13C480&amp;lt;SEP&amp;gt;Casual&amp;lt;SEP&amp;gt;I Didn't Mean To)(TRAAABD128F429CF47&amp;lt;SEP&amp;gt;SOCIWDW12A8C13D406&amp;lt;SEP&amp;gt;The Box Tops&amp;lt;SEP&amp;gt;Soul Deep)(TRAAADZ128F9348C2E&amp;lt;SEP&amp;gt;SOXVLOJ12AB0189215&amp;lt;SEP&amp;gt;Sonora Santanera&amp;lt;SEP&amp;gt;Amor De Cabaret)(TRAAAEF128F4273421&amp;lt;SEP&amp;gt;SONHOTT12A8C13493C&amp;lt;SEP&amp;gt;Adam Ant&amp;lt;SEP&amp;gt;Something Girls)(TRAAAFD128F92F423A&amp;lt;SEP&amp;gt;SOFSOCN12A8C143F5D&amp;lt;SEP&amp;gt;Gob&amp;lt;SEP&amp;gt;Face the Ashes)(TRAAAMO128F1481E7F&amp;lt;SEP&amp;gt;SOYMRWW12A6D4FAB14&amp;lt;SEP&amp;gt;Jeff And Sheri Easter&amp;lt;SEP&amp;gt;The Moon And I (Ordinary Day Album Version))(TRAAAMQ128F1460CD3&amp;lt;SEP&amp;gt;SOMJBYD12A6D4F8557&amp;lt;SEP&amp;gt;Rated R&amp;lt;SEP&amp;gt;Keepin It Real (Skit))(TRAAAPK128E0786D96&amp;lt;SEP&amp;gt;SOHKNRJ12A6701D1F8&amp;lt;SEP&amp;gt;Tweeterfriendly Music&amp;lt;SEP&amp;gt;Drop of Rain)(TRAAARJ128F9320760&amp;lt;SEP&amp;gt;SOIAZJW12AB01853F1&amp;lt;SEP&amp;gt;Planet P Project&amp;lt;SEP&amp;gt;Pink World)(TRAAAVG12903CFA543&amp;lt;SEP&amp;gt;SOUDSGM12AC9618304&amp;lt;SEP&amp;gt;Clp&amp;lt;SEP&amp;gt;Insatiable (Instrumental Version))&lt;/PRE&gt;&lt;P&gt;Thanks !!&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 12:17:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-MillionSongsSubset-data-in-Pig/m-p/180056#M58526</guid>
      <dc:creator>shalini_goel</dc:creator>
      <dc:date>2017-03-30T12:17:45Z</dc:date>
    </item>
    <item>
      <title>Re: Load MillionSongsSubset data in Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-MillionSongsSubset-data-in-Pig/m-p/180057#M58527</link>
      <description>&lt;P&gt;Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab.  If it were, you could define the delimiter on LOAD with USING PigStorage('&lt;EM&gt;delim&lt;/EM&gt;') where &lt;EM&gt;delim&lt;/EM&gt; would be an actual delimiter like , or | or \t.&lt;/P&gt;&lt;P&gt;The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data.  See &lt;A href="https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf" target="_blank"&gt;https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf&lt;/A&gt; &lt;/P&gt;&lt;P&gt;You need to use a wrapper API to work with it: &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://labrosa.ee.columbia.edu/millionsong/pages/hdf-what" target="_blank"&gt;https://labrosa.ee.columbia.edu/millionsong/pages/hdf-what&lt;/A&gt; &lt;/LI&gt;&lt;LI&gt;&lt;A href="https://support.hdfgroup.org/downloads/" target="_blank"&gt;https://support.hdfgroup.org/downloads/&lt;/A&gt;  &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format.  Then you could load it to pig as described above.&lt;/P&gt;&lt;P&gt;In addition to the above links, this link is generally useful for your data set: &lt;A href="https://labrosa.ee.columbia.edu/millionsong/faq" target="_blank"&gt;https://labrosa.ee.columbia.edu/millionsong/faq&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 19:28:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-MillionSongsSubset-data-in-Pig/m-p/180057#M58527</guid>
      <dc:creator>gkeys</dc:creator>
      <dc:date>2017-03-30T19:28:38Z</dc:date>
    </item>
  </channel>
</rss>

