Support Questions
Find answers, ask questions, and share your expertise

Confusion in documentation : Configuring Spark for Wire Encryption?

Expert Contributor

Hi all,

I was going through the latest documentation on Hortonworks website : http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/spark-encry...

I am unable to understand the following line :

-

Configuring Spark for Wire Encryption

"You can configure Spark to protect sensitive data in transit by enabling wire encryption. Spark supports SSL for broadcast and file server protocols, and it uses SASL encryption for the block transfer service. Note, however, that wire encryption is not yet supported for shuffle files, cached data, and other application files."

- Protect sensitive data in transit by encryption ( seems to be for data ingestion part but how? From Kafka?)

- park supports SSL for broadcast and file server protocols,.... (OK)

-however, that wire encryption is not yet supported for shuffle files, cached data, and other application files.".

So where do I get data encrypted and where data is secured and unsecured during the start of the job to execution is finished?

Can someone please enlighten on this?

BTW: In Spark's context, where wire encryption comes into picture?

Many thanks,

1 ACCEPTED SOLUTION

Accepted Solutions

Expert Contributor

Encryption applies to data read from external sources by the job, intermediate data written by the job and data moved across the network within the job.

Connectors to external sources need to take care of their encrypted access. Apache Spark play no role there. E.g. HDFS allows transparent encrypted data access and so Spark jobs can read encrypted data from HDFS.

Intermediate data within a job can be written to local disk by the executors (e.g. when the data does not fit in memory). Apache Spark does not support encrypted data on local disk. For that the recommendation is to enable OS level local disk encryption.

Data moved across the network within a Spark job can be encrypted by Spark itself. So data moved between executors and the drivers (e.g. during collect()) or between executors (e.g. during shuffle) can be encrypted by Apache Spark.

So in the context of Spark, wire encryption comes into the picture whenever data moves across the network between spark processes.

View solution in original post

5 REPLIES 5

Expert Contributor

Encryption applies to data read from external sources by the job, intermediate data written by the job and data moved across the network within the job.

Connectors to external sources need to take care of their encrypted access. Apache Spark play no role there. E.g. HDFS allows transparent encrypted data access and so Spark jobs can read encrypted data from HDFS.

Intermediate data within a job can be written to local disk by the executors (e.g. when the data does not fit in memory). Apache Spark does not support encrypted data on local disk. For that the recommendation is to enable OS level local disk encryption.

Data moved across the network within a Spark job can be encrypted by Spark itself. So data moved between executors and the drivers (e.g. during collect()) or between executors (e.g. during shuffle) can be encrypted by Apache Spark.

So in the context of Spark, wire encryption comes into the picture whenever data moves across the network between spark processes.

View solution in original post

Expert Contributor

Thanks @bikas, @lgeorge

Does it mean configuring "Configuring Spark for Wire Encryption" from the documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/spark-encry... we will get "data encryption" for the data which is moved inside the network between the single job (between executors) as parts of the tasks. (I.e. during the internal transit of the data among the nodes inside the cluster).

Does "wire encryption for Spark" touch other avenues/benefits also?

Many thanks,

SS

Expert Contributor

Yes. It means encrypting all network transfers within the Spark job. There are no other avenues for wire encryption within Spark. Starting Spark 2.0 enabling wire encryption also enables https on the history server UI for browsing historical job data.

Expert Contributor

@Smart Solutions there is also some related info for Apache Spark version 1.6.2 (shipped with HDP 2.5) at https://spark.apache.org/docs/1.6.2/security.html#encryption.

Expert Contributor

The HDP Spark Component Guide (versions 2.5.0+) has been updated per Bikas's clarification,

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/spark-encry...