Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Securing Parquet Files Column-wise

avatar

I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.

I have been trying Ranger through HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.

I have also been able to devise a solution using Apache Drill and its views capability, however, it is not acceptable right now mainly because of the still scarce community support.

Has anyone faced the same requirement and/or have some directions for a solution?

6 REPLIES 6

avatar
Expert Contributor

Could you try the following SPARK-LLAP? It uses Hive LLAP and Ranger inside Spark.

Row/Column-level Security in SQL for Apache Spark

avatar

Hi @Dongjoon Hyun, thanks for the reply.

The tutorial is great, very clear, however, how could I apply that to Parquet files? (sorry if a newbie question, but I'm indeed a newbie 🙂 )

avatar
Expert Contributor

Please create a Hive table on those Parquet files. If Hive can access them securely with Ranger, Spark also can via SPARK-LLAP.

avatar
Expert Contributor

@Felipe Melo Does it solve your problem?

avatar

Hi @Dongjoon Hyun. That definitely works, however, the requirements I had could not be addressed with that course of action. I was looking for a solution that works on Parquet files the same way Ranger works with Hive, for instance. I'd like to go to Ranger and set specific permissions directly on Parquet columns without having to first load the files into Hive.

After better understanding how Ranger works I could realize that this is not possible, as Ranger works with hooks (plug-ins) to the tools it secures (HDFS, HBase, Hive, etc) and Parquet is simply a file format. A solution I started to investigate is an extension to the HDFS plug-in which could act on Parquet files, filtering access as specified through Ranger. With that solution, Parquet files could be secured at a column-level directly from Ranger as long as it's stored in HDFS.

Anyway, thank you very much for the replies and also for checking on the result.

avatar
Expert Contributor

I see. Yes, Ranger and Parquet does. I believe you can find a way for your requirements!