Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Programmatic Way of Determining Data Storage Format?


Programmatic Way of Determining Data Storage Format?


I am writing a java map reduce job that reads through a variety of files and directories. I have no idea of knowing ahead of time if the data storage type is plain text, orc, avro, what have you. Is there some way to determine the storage format of a file programmatically? I haven't found anything in any of the file system apis.


Re: Programmatic Way of Determining Data Storage Format?

Rising Star

Unfortunately there's no File System API to do that.

Apache Tika does file type detection ( and will provide you base APIs that you can extend to create a detector for ORC, Avro etc.

To detect whether a file is a given type you can:

Create a job builder to construct / initialize the mapreduce driver based on the file format, detected using the logic above.

An important point to remember is you will need careful input splitting for any of these formats and the criteria for split varies.

Hope this helps!

Don't have an account?
Coming from Hortonworks? Activate your account here