I am writing a java map reduce job that reads through a variety of files and directories. I have no idea of knowing ahead of time if the data storage type is plain text, orc, avro, what have you. Is there some way to determine the storage format of a file programmatically? I haven't found anything in any of the file system apis.
Unfortunately there's no File System API to do that.
Apache Tika does file type detection (https://tika.apache.org/1.1/detection.html) and will provide you base APIs that you can extend to create a detector for ORC, Avro etc.
To detect whether a file is a given type you can:
Create a job builder to construct / initialize the mapreduce driver based on the file format, detected using the logic above.
An important point to remember is you will need careful input splitting for any of these formats and the criteria for split varies.
Hope this helps!