Support Questions

Find answers, ask questions, and share your expertise

what is the best way to store small files in hadoop and retrieve them when required

avatar
Expert Contributor

We have a situation where in we have lots of small xml files residing on Unix NAS and its associated metadata in Oracle DB.

we want to combine this 3 month old XML and its associated metadata in 1 archive file (10GB) and want to store in hadoop .

whats the best way to implement this in hadoop ? Note after creating 1 big archive , we will have many small files (each file size may be 1Mb or less) inside my archive so i would reduce block size to 32MB for example may be

I read about hadoop archive .har files or storing data in hbase

would like to know pros/cons from hadoop community experience whats the recommended practice for such situations

can you please advise

also reducing hdfs block size to 32 MB to cater to this requirement ? how does it look

I want to read this data from hadoop whenever needed without affecting performance

Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@ripunjay godhani

I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem.

The primary questions that need to be asked when picking up the data archive strategy are

  • How am I going to access this data?
  • How often am I going to access this archived data?
  • Am I going to be bound by some stringent SLAs?

The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file.

You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above.

So here are my two suggestions (in order of preference):

  1. Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there)
  2. Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get)

Let me know if you have further questions.

Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

@ripunjay godhani

I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem.

The primary questions that need to be asked when picking up the data archive strategy are

  • How am I going to access this data?
  • How often am I going to access this archived data?
  • Am I going to be bound by some stringent SLAs?

The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file.

You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above.

So here are my two suggestions (in order of preference):

  1. Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there)
  2. Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get)

Let me know if you have further questions.

Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!

avatar
Expert Contributor

thanks so much for answering , i think i am closer to answer

please elaborate below solutions as advised by you , i am not much familiar with hive

  1. Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there)
  2. Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get)

FYI below:

Please advice on how to store and read archive data from hive

while storing data in hive , should i save it as .har in hdfs ?

our application generatee small size xml files which are stored on NAS and XML associated metadata in DB .

plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old

so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase

Please advise what do you think will be the better approach for reading this archived data

suppose i am storing this archived data in hive so how do i retrieve this archived data

Please guide me for storing archived data in hive

also guide for Retrieving/reading archived data from hive when needed

avatar
Super Collaborator

Hive is very similar to a database design - so as a first step you can create a hive table using syntax like (in its simplest form)

create table table_name (
  id                int,
  date       	    string,
  name              string
)
partitioned by (date string)

There are many variants that you can add to this table creation such as where it is stored, how it is delimited, etc.. but in my opinion keep it simple first and then you can expand your mastery. This link (the one that I always refer to) will talk in detail on the syntax (for DDL operations), different options etc - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Once you got this taken care of.. you can then start inserting data into Hive. Different options available for this is explained here at the DML documentation - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

So these 2 links will be good to start for getting closer to hive in general.

Then sepecifically for your question on loading xml data - you can either load the whole xml file data as a single column and then read it using xpath udf at the read time, or break each xml tags as a seperate column at the write time. I will go through both of those options here in little details:

Writing xml data as a single column: you can simply create a table like

CREATE TABLE xmlfiles (id int, xmlfile string)

and then put the entire xml data into the string column. Then at the time of reading, you can use the XPATH udf (user defined function that come along with Hive) to read the data. Details here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

This approach is easy to write data, but may have some performance overhead at the time of reading data (as well as limitations on doing some aggregates on the result set)

Writing xml data as a columnar value into Hive: This approach is little more drawn out at the time of writing data. but easier and more flexible for read operation.

Here first you convert your xml data into either an Avro or Json and then using one of the serde (Serialize / deserialize) to write data to Hive. This will give you some context - https://community.hortonworks.com/repos/30883/hive-json-serde.html

Hope this makes sense.

If you find this answer helpful, please 'Accept' my initial answer above