Load MillionSongsSubset data in Pig

shalini_goel — Thu, 30 Mar 2017 12:17:45 GMT

Hi All,

I have downloaded millionsongssubset data from http://static.echonest.com/millionsongsubset_full.tar.gz and tried to upload it

and print sample

songs = LOAD '/user/root/datasets/millionsongsubset_full.tar.gz'
songs_limit = LIMIT songs 10;
DUMP songs_limit;

Records are displayed as below. Please suggest how to upload above downloaded data in right format

grunt> DUMP songs_limit;2017-03-30 05:03:24,383 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT2017-03-30 05:03:24,458 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,459 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}2017-03-30 05:03:24,474 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 12017-03-30 05:03:24,479 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,507 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 12017-03-30 05:03:24,524 [main] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz]2017-03-30 05:03:24,609 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://sandbox.technocrafty:8020/tmp/temp-607255022/tmp-1815565156/_temporary/0/task__0001_m_0000012017-03-30 05:03:24,646 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,655 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,656 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1(MillionSongSubset/0000755000175000017500000000000011516357374014450 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/0000755000175000017500000000000011516366075017501 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/subset_unique_tracks.txt0000644000175000017500000317201311516365717024515 0ustar  thierrythierryTRAAAAW128F429D538<SEP>SOMZWCG12A8C13C480<SEP>Casual<SEP>I Didn't Mean To)(TRAAABD128F429CF47<SEP>SOCIWDW12A8C13D406<SEP>The Box Tops<SEP>Soul Deep)(TRAAADZ128F9348C2E<SEP>SOXVLOJ12AB0189215<SEP>Sonora Santanera<SEP>Amor De Cabaret)(TRAAAEF128F4273421<SEP>SONHOTT12A8C13493C<SEP>Adam Ant<SEP>Something Girls)(TRAAAFD128F92F423A<SEP>SOFSOCN12A8C143F5D<SEP>Gob<SEP>Face the Ashes)(TRAAAMO128F1481E7F<SEP>SOYMRWW12A6D4FAB14<SEP>Jeff And Sheri Easter<SEP>The Moon And I (Ordinary Day Album Version))(TRAAAMQ128F1460CD3<SEP>SOMJBYD12A6D4F8557<SEP>Rated R<SEP>Keepin It Real (Skit))(TRAAAPK128E0786D96<SEP>SOHKNRJ12A6701D1F8<SEP>Tweeterfriendly Music<SEP>Drop of Rain)(TRAAARJ128F9320760<SEP>SOIAZJW12AB01853F1<SEP>Planet P Project<SEP>Pink World)(TRAAAVG12903CFA543<SEP>SOUDSGM12AC9618304<SEP>Clp<SEP>Insatiable (Instrumental Version))

Thanks !!

Re: Load MillionSongsSubset data in Pig

gkeys — Thu, 30 Mar 2017 19:28:38 GMT

Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t.

The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf

You need to use a wrapper API to work with it:

In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above.

In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq

question Load MillionSongsSubset data in Pig in Archives of Support Questions (Read Only)

Load MillionSongsSubset data in Pig

Re: Load MillionSongsSubset data in Pig