Support Questions

Grg · ‎08-14-2015

Hello 🙂

Using Hadoop commands, I'm able to

build a HAR (Hadoop Archive) that is stored in HDFS cluster,
list the content of the archive,
access a file in the archive.

Using Spark, I'am only able to read a file from the archive.

Using Spark, is it possible to

build an Hadoop Archive to be stored in HDFS cluster?
list the content of an Hadoop Archive?

Thanks for your help,

Greg.

Wilfred · ‎08-19-2015

This is not a case of not being documented: a har file is created by running a MR job. When accessing it you use the har uri and really are just following pointers.

i would suggest that you look at sequence files and not at the har archives. Sequence files are the solution for the issue you are looking at and can be created and accessed using the standard API.

Wilfred

View solution in original post

Wilfred · ‎08-18-2015

HAR's are special and not really a file format. The implementation is part of the FileSystem (i.e. listing files in an archive is done via hdfs dfs -ls har:///har_file.har)

Why are you want to create har files, using a sequence file sor some other container format to store the files might be much easier.

I am not sure that the code in Spark will handle the har: URI.

Wilfred

Grg · ‎08-18-2015

Thanks for your answer Wilfred,

I'm working on a Spark application that may handle not only big data but also plenty of small files. As HDFS is not optimized for small data manipulation and as this may impact NameNode performances, I would like to study the idea to group small files in functional Hadoop Archives. And, if possible, I would like to be able to manage those HAR directly from a Spark application and not from Hadoop shell...

It may not be possible at all, but this is something I'm interested in. And as Spark is able to read the content of a HAR, maybe can it also create HARs... But there is a lack of documentation around this kind of features.

Wilfred · ‎08-19-2015

This is not a case of not being documented: a har file is created by running a MR job. When accessing it you use the har uri and really are just following pointers.

i would suggest that you look at sequence files and not at the har archives. Sequence files are the solution for the issue you are looking at and can be created and accessed using the standard API.

Wilfred

Grg · ‎08-19-2015

Thanks for your suggestions and documentation links Wilfred, I will have a look at sequence files today.

Grg · ‎08-20-2015

Hello,

I still need to dig this but I will also check the MapFiles that are some Indexed SequenceFiles. I'll provide me feedback then 🙂

Greg.

Cloudera Community

Support Questions

Build a HAR (Hadoop Archive) using Spark