Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Load MillionSongsSubset data in Pig

Solved Go to solution
Highlighted

Load MillionSongsSubset data in Pig

New Contributor

Hi All,

I have downloaded millionsongssubset data from http://static.echonest.com/millionsongsubset_full.tar.gz and tried to upload it

and print sample

songs = LOAD '/user/root/datasets/millionsongsubset_full.tar.gz'
songs_limit = LIMIT songs 10;
DUMP songs_limit;

Records are displayed as below. Please suggest how to upload above downloaded data in right format

grunt> DUMP songs_limit;2017-03-30 05:03:24,383 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT2017-03-30 05:03:24,458 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,459 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}2017-03-30 05:03:24,474 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 12017-03-30 05:03:24,479 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,507 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 12017-03-30 05:03:24,524 [main] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz]2017-03-30 05:03:24,609 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://sandbox.technocrafty:8020/tmp/temp-607255022/tmp-1815565156/_temporary/0/task__0001_m_0000012017-03-30 05:03:24,646 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,655 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,656 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1(MillionSongSubset/0000755000175000017500000000000011516357374014450 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/0000755000175000017500000000000011516366075017501 5ustar  thierrythierryMillionSongSubset/AdditionalFiles/subset_unique_tracks.txt0000644000175000017500000317201311516365717024515 0ustar  thierrythierryTRAAAAW128F429D538<SEP>SOMZWCG12A8C13C480<SEP>Casual<SEP>I Didn't Mean To)(TRAAABD128F429CF47<SEP>SOCIWDW12A8C13D406<SEP>The Box Tops<SEP>Soul Deep)(TRAAADZ128F9348C2E<SEP>SOXVLOJ12AB0189215<SEP>Sonora Santanera<SEP>Amor De Cabaret)(TRAAAEF128F4273421<SEP>SONHOTT12A8C13493C<SEP>Adam Ant<SEP>Something Girls)(TRAAAFD128F92F423A<SEP>SOFSOCN12A8C143F5D<SEP>Gob<SEP>Face the Ashes)(TRAAAMO128F1481E7F<SEP>SOYMRWW12A6D4FAB14<SEP>Jeff And Sheri Easter<SEP>The Moon And I (Ordinary Day Album Version))(TRAAAMQ128F1460CD3<SEP>SOMJBYD12A6D4F8557<SEP>Rated R<SEP>Keepin It Real (Skit))(TRAAAPK128E0786D96<SEP>SOHKNRJ12A6701D1F8<SEP>Tweeterfriendly Music<SEP>Drop of Rain)(TRAAARJ128F9320760<SEP>SOIAZJW12AB01853F1<SEP>Planet P Project<SEP>Pink World)(TRAAAVG12903CFA543<SEP>SOUDSGM12AC9618304<SEP>Clp<SEP>Insatiable (Instrumental Version))

Thanks !!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Load MillionSongsSubset data in Pig

Guru

Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t.

The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf

You need to use a wrapper API to work with it:

In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above.

In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq

1 REPLY 1

Re: Load MillionSongsSubset data in Pig

Guru

Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t.

The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf

You need to use a wrapper API to work with it:

In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above.

In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq

Don't have an account?
Coming from Hortonworks? Activate your account here