I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.
I have been trying Ranger through HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.
I have also been able to devise a solution using Apache Drill and its views capability, however, it is not acceptable right now mainly because of the still scarce community support.
Has anyone faced the same requirement and/or have some directions for a solution?
Please create a Hive table on those Parquet files. If Hive can access them securely with Ranger, Spark also can via SPARK-LLAP.
Hi @Dongjoon Hyun. That definitely works, however, the requirements I had could not be addressed with that course of action. I was looking for a solution that works on Parquet files the same way Ranger works with Hive, for instance. I'd like to go to Ranger and set specific permissions directly on Parquet columns without having to first load the files into Hive.
After better understanding how Ranger works I could realize that this is not possible, as Ranger works with hooks (plug-ins) to the tools it secures (HDFS, HBase, Hive, etc) and Parquet is simply a file format. A solution I started to investigate is an extension to the HDFS plug-in which could act on Parquet files, filtering access as specified through Ranger. With that solution, Parquet files could be secured at a column-level directly from Ranger as long as it's stored in HDFS.
Anyway, thank you very much for the replies and also for checking on the result.