08-14-2015 07:17 AM - edited 08-14-2015 07:22 AM
Using Hadoop commands, I'm able to
Using Spark, I'am only able to read a file from the archive.
Using Spark, is it possible to
Thanks for your help,
08-18-2015 08:28 PM
HAR's are special and not really a file format. The implementation is part of the FileSystem (i.e. listing files in an archive is done via hdfs dfs -ls har:///har_file.har)
Why are you want to create har files, using a sequence file sor some other container format to store the files might be much easier.
I am not sure that the code in Spark will handle the har: URI.
08-18-2015 11:46 PM
Thanks for your answer Wilfred,
I'm working on a Spark application that may handle not only big data but also plenty of small files. As HDFS is not optimized for small data manipulation and as this may impact NameNode performances, I would like to study the idea to group small files in functional Hadoop Archives. And, if possible, I would like to be able to manage those HAR directly from a Spark application and not from Hadoop shell...
It may not be possible at all, but this is something I'm interested in. And as Spark is able to read the content of a HAR, maybe can it also create HARs... But there is a lack of documentation around this kind of features.
08-19-2015 12:16 AM
This is not a case of not being documented: a har file is created by running a MR job. When accessing it you use the har uri and really are just following pointers.
i would suggest that you look at sequence files and not at the har archives. Sequence files are the solution for the issue you are looking at and can be created and accessed using the standard API.