Is it advisable to use Ceph storage as a way to provide underlying storage to an HDFS system or should one steer clear of this ?
@Neeraj Sabharwal That is fairly interesting actually. What made me think of some sort of official support was that Ceph docs (albeit 1.1) mention a process on how to do it:
And also there is this tutorial from:
It keeps saying the word shim, so I kind of see where you are coming from on this. Thanks for the guidance !
There isn't really much in the way of Ceph integration. There is a published filesystem client JAR which, if you get on your classpath, should let you refer to data using ceph:// as the path. You also appear to need its native lib on the path, which is a bit trickier.
This comes from the Ceph team, not the Hadoop people, and
1. I don't know how up to date/in sync it is with recent Hadoop versions.
2. It doesn't get released or tested by the Hadoop team: we don't know how well it works, or how it goes wrong.
Filesystems are an interesting topic in Hadoop. Its a core critical part of the system: you don't want to lose data. And while there's lots of support for different filesystem implementations in hadoop (s3n, avs, ftp , swift: file:), HDFS is the one things are built and tested against. Object stores (s3, swift) are not real filesystems, and cannot be used in place of HDFS as the direct output of MR, Tez or spark jobs; and absolutely never to run HBase or accumulo atop.
I don't know where ceph fits in here. It's probably safe to use it as a source of data; it's as the destination where the differences usually show up.
Finally: HDP is not tested on Ceph, so cannot be supported. We do test on HDFS, against Azure storage (in HD/Insight), and on other filesystems (e.g. Isilon). I don't know of anyone else who tests Hadoop on Ceph, the way, say Redhat do with Gluster FS.