Created 01-22-2017 06:23 PM
We have a situation where in we have lots of small xml files residing on Unix NAS and its associated metadata in Oracle DB.
we want to combine this 3 month old XML and its associated metadata in 1 archive file (10GB) and want to store in hadoop .
whats the best way to implement this in hadoop ? Note after creating 1 big archive , we will have many small files (each file size may be 1Mb or less) inside my archive so i would reduce block size to 32MB for example may be
I read about hadoop archive .har files or storing data in hbase
would like to know pros/cons from hadoop community experience whats the recommended practice for such situations
can you please advise
also reducing hdfs block size to 32 MB to cater to this requirement ? how does it look
I want to read this data from hadoop whenever needed without affecting performance
Thanks in advance
Created 01-23-2017 05:57 AM
I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem.
The primary questions that need to be asked when picking up the data archive strategy are
The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file.
You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above.
So here are my two suggestions (in order of preference):
Let me know if you have further questions.
Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!
Created 01-23-2017 05:57 AM
I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem.
The primary questions that need to be asked when picking up the data archive strategy are
The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file.
You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above.
So here are my two suggestions (in order of preference):
Let me know if you have further questions.
Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!
Created 01-29-2017 03:46 PM
thanks so much for answering , i think i am closer to answer
please elaborate below solutions as advised by you , i am not much familiar with hive
FYI below:
Please advice on how to store and read archive data from hive
while storing data in hive , should i save it as .har in hdfs ?
our application generatee small size xml files which are stored on NAS and XML associated metadata in DB .
plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old
so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase
Please advise what do you think will be the better approach for reading this archived data
suppose i am storing this archived data in hive so how do i retrieve this archived data
Please guide me for storing archived data in hive
also guide for Retrieving/reading archived data from hive when needed
Created 02-01-2017 04:12 PM
Hive is very similar to a database design - so as a first step you can create a hive table using syntax like (in its simplest form)
create table table_name ( id int, date string, name string ) partitioned by (date string)
There are many variants that you can add to this table creation such as where it is stored, how it is delimited, etc.. but in my opinion keep it simple first and then you can expand your mastery. This link (the one that I always refer to) will talk in detail on the syntax (for DDL operations), different options etc - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Once you got this taken care of.. you can then start inserting data into Hive. Different options available for this is explained here at the DML documentation - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
So these 2 links will be good to start for getting closer to hive in general.
Then sepecifically for your question on loading xml data - you can either load the whole xml file data as a single column and then read it using xpath udf at the read time, or break each xml tags as a seperate column at the write time. I will go through both of those options here in little details:
Writing xml data as a single column: you can simply create a table like
CREATE TABLE xmlfiles (id int, xmlfile string)
and then put the entire xml data into the string column. Then at the time of reading, you can use the XPATH udf (user defined function that come along with Hive) to read the data. Details here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF
This approach is easy to write data, but may have some performance overhead at the time of reading data (as well as limitations on doing some aggregates on the result set)
Writing xml data as a columnar value into Hive: This approach is little more drawn out at the time of writing data. but easier and more flexible for read operation.
Here first you convert your xml data into either an Avro or Json and then using one of the serde (Serialize / deserialize) to write data to Hive. This will give you some context - https://community.hortonworks.com/repos/30883/hive-json-serde.html
Hope this makes sense.
If you find this answer helpful, please 'Accept' my initial answer above