Created 11-08-2017 01:01 AM
I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.
I have been trying Ranger through HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.
I have also been able to devise a solution using Apache Drill and its views capability, however, it is not acceptable right now mainly because of the still scarce community support.
Has anyone faced the same requirement and/or have some directions for a solution?
Created 11-08-2017 06:39 PM
Could you try the following SPARK-LLAP? It uses Hive LLAP and Ranger inside Spark.
Created 11-09-2017 04:35 PM
Hi @Dongjoon Hyun, thanks for the reply.
The tutorial is great, very clear, however, how could I apply that to Parquet files? (sorry if a newbie question, but I'm indeed a newbie 🙂 )
Created 11-09-2017 04:42 PM
Please create a Hive table on those Parquet files. If Hive can access them securely with Ranger, Spark also can via SPARK-LLAP.
Created 12-04-2017 06:58 PM
@Felipe Melo Does it solve your problem?
Created 12-04-2017 11:58 PM
Hi @Dongjoon Hyun. That definitely works, however, the requirements I had could not be addressed with that course of action. I was looking for a solution that works on Parquet files the same way Ranger works with Hive, for instance. I'd like to go to Ranger and set specific permissions directly on Parquet columns without having to first load the files into Hive.
After better understanding how Ranger works I could realize that this is not possible, as Ranger works with hooks (plug-ins) to the tools it secures (HDFS, HBase, Hive, etc) and Parquet is simply a file format. A solution I started to investigate is an extension to the HDFS plug-in which could act on Parquet files, filtering access as specified through Ranger. With that solution, Parquet files could be secured at a column-level directly from Ranger as long as it's stored in HDFS.
Anyway, thank you very much for the replies and also for checking on the result.
Created 12-05-2017 12:05 AM
I see. Yes, Ranger and Parquet does. I believe you can find a way for your requirements!