Created 01-25-2017 02:34 PM
Hi everybody, I just wanted to know if one can use the HDFS and HAR-Files as a real electronic archive that is revision safe? Has anybody practical experience with that? Can anybody give me input to that, because the information I found to this topic is not very much. Thanks
Created 01-25-2017 04:03 PM
Hi @Alexander Lösel, can you expand what you mean by "revision safe"? If you want read only access for users on those files you can specify that within Ranger.
You can manually set HDFS ACL permissions via command line but Ranger is the way to go if you're planning access in a multi-tenant environment.
Created 01-25-2017 04:15 PM
Hi Ameet,
thanks for your quick reply!
With revision safe I mean different aspects, for example:
- changes have to be tracked and also the original record/archive has to be preserved when overwritten (this could be done through an additional application layer) but:
- also deletion of an HAR file must be tracked in protocols that are also safe against manipulation
- it must be possible to preserve the deletion of archives for a defined period of time (for example 10 years)
If much of that functionality has to be done in an additional application, then the communication between that application and HDFS (HAR) has to be restricted so that no other application could do that.
I know, this question is very common in its formulation, but I want to get an idea of what HAR can do and what it can't.
Thanks,
Alex
Created 01-25-2017 04:10 PM
HDFS is a platform to store data while HAR is a file format (HTTP archive format). You can store HAS files in HDFS. As for version safe, then answer is yes and there are couple of ways. If your HAR files are small, then may be use HBase as it offers a built in mechanism to store versioned data (it will not be HAR format anymore, but that shouldn't matter).
If you must store HAR files, then you can store them in HDFS but make sure subsequent version of same HAR files are named different (for myarchivev1.har and myarchivev2.har and so on). Data in HDFS cannot be modified so your versions are safe but same file can be overridden, so your new files must be named different to retain previous version of the file.
Last but not least, HAR files contain HTTP session data which tends to be not too large. Is your HAR file going to be small or large? Do not bring your archive into HDFS if you are going to have small 1MB size files. Your ideal minimum size should be above 100 MB (or at least 64 MB).
If your files are small, then it might make sense to use HBase (remember format stored will not be HAR anymore in this case).
Created 01-25-2017 04:23 PM
Hi @mqureshi, thanks for the quick response
ok. I've some additional questions:
- is it possible to track changes in HDFS or HBase (change, delete, ...)?
- how can user permissions be applied to items laying in HDFS / HBase?
Thanks in advance,
Alex
Created 01-25-2017 07:29 PM
Please see my replies inline below:
- is it possible to track changes in HDFS or HBase (change, delete, ...)?
you mean the kind of tracked changes a version control system does? Answer is no. HBase or HDFS simply provide you an ability to store your data. HBase provides lot of features including the ability to store multiple versions of same data. This means you can create an application which can tell what changes occured between two versions. HBase by itself will not tell you that. Same with HDFS.
- how can user permissions be applied to items laying in HDFS / HBase?
This is easy. You use Apache Ranger for comprehensive authorization. End to end security with proper authorization is can easily be implemented for both HBase and HDFS. Look for Ranger suport for HDFS and HBase. Not only this, you can also audit, and track lineage of the data, literally all the way to the source if you use Nifi to ingest. Without Nifi, you can still track lineage using Ranger and Atlas to whichever source you are ingesting data into HBase/HDFS from. Nifi on the other hand will give you the ability to capture data from your webserver, so you'll have the ability to track lineage all the way to where data was created, without writing a single line of code (Nifi is also UI based with a REST api if you need for your use case).
Created 01-25-2017 04:42 PM
- also deletion of an HAR file must be tracked in protocols that are also safe against manipulation
Ranger has audit capabilities and it integrates with AD/LDAP services.
- it must be possible to preserve the deletion of archives for a defined period of time (for example 10 years)
Within Ranger you can remove users access to the files and you can use HDFS for archival (it's pretty good for that ;). If users need to access this "cold" data again, just enable the permission within Ranger. From the user perspective the file was "deleted."
To @mqureshi's point, you'll need to think about the application layer. You don't want load these small files one at a time into HDFS and the app you pick can help you enforce some of your requirements. You can use Nifi to acquire, route, and transform the HAR data as well prior to landing into HDFS so look into that as well.
Hope this helps,