Created 12-22-2023 01:05 PM
Hello,
Studying the documentation below I found a good example of how to use Spark with JDBC to connect to external databases.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
I already tested the example and it worked very well. However, when I looked carefully I saw that the password (its text) was being manipulated directly by the program. This way, from the tests I did, the password is open both in the Spark log and the connection, if it is intercepted on the network, the password will be open in open text.
I'm looking for a way to make this connection safely. One mechanism that comes to mind is the hadoop credential provider that I already used with good old sqoop. So in sqoop the password was protected and no visible in any form, even if the connection was intercepted.
Could anyone tell me what mechanisms I can use to make this connection securely, without exposing credentials? It can be a credential provider or something else.
I didn't find anything like that in the Spark documentation.
Created 01-09-2024 03:40 AM
As I was already using the Hadoop Credential Provider, I found a solution that does not require decrypting the password as follows:
PySpark code:
# Spark session
spark = SparkSession.builder \
    .config("spark.yarn.keytab=/etc/security/keytabs/<APPLICATION_USER>.keytab") \
    .appName('SPARK_TEST') \
    .master("yarn") \
    .getOrCreate()
	
credential_provider_path = 'jceks://hdfs/<PATH>/<CREDENTIAL_FILE>.jceks' 
credential_name = 'PASSWORD.ALIAS'
# Hadoop credential
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set('hadoop.security.credential.provider.path',credential_provider_path)
credential_raw = conf.getPassword(credential_name)
for i in range(credential_raw.__len__()):
    password = password + str(credential_raw.__getitem__(i))The important point above is the .config() line in SparkSession. You must enter the keytab to access the password. Otherwise you will get the encrypted value.
I can't say that I'm very happy with being able to directly manipulate the password value in the code. I would like to delegate this to some component in a way that the programmer does not have direct access to the password value.
Maybe what I'm looking for is some kind of authentication provider, but for now the solution above works for me.
Created 12-26-2023 10:18 PM
You're right to be concerned about password security, especially in a distributed environment. Spark doesn't inherently provide a built-in secure password handling mechanism like the Hadoop Credential Provider, but there are several approaches you can consider to enhance security when dealing with passwords:
Credential Providers: While Spark itself doesn't have a native credential provider, you might consider using Hadoop Credential Providers in combination with Spark. You can store sensitive information like passwords in Hadoop's CredentialProvider API. Then, you'd access these securely stored credentials in your Spark job.
Environment Variables: You can set the password as an environment variable on the cluster or machine running Spark. Accessing environment variables in your Spark code helps avoid directly specifying passwords in code or configuration files.
Key Management Services (KMS): Some cloud providers offer key management services that allow you to securely store and manage credentials. You can retrieve these credentials dynamically in your Spark application.
Secure Storage Systems: Leverage secure storage systems or secret management tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault. These tools provide secure storage for sensitive information and offer APIs to retrieve credentials when needed by your application.
Secure File Systems: Utilize secure file systems or encryption mechanisms to protect sensitive configuration files. These files could contain the necessary credentials, and access could be restricted using appropriate permissions.
Encryption and Secure Communication: Ensure that communication between Spark and external systems is encrypted (e.g., using SSL/TLS) to prevent eavesdropping on the network.
Token-Based Authentication: Whenever possible, consider using token-based authentication mechanisms instead of passwords. Tokens can be time-limited and are generally safer for communication over the network.
When implementing these measures, it's crucial to balance security with convenience and operational complexity. Choose the approach that aligns best with your security policies, deployment environment, and ease of management for your use case.
Created 12-27-2023 04:37 PM
There is a two-step process to achieve your scenario: encrypt the password externally and decrypt it within the Spark code.
Step1: Encrypt the Password
val password = "YOUR_PASSWORD"
val salt_key = "YOUR_SALT_KEY"
val encryptedPassword = EncryptionUtil.encrypt(password, salt_key)
Step2: Decrypt the password inside the Spark code and Read JDBC data:
val encryptedPassword = sys.env("ENCRYPTED_PASSWORD")
val saltKey = sys.env("SALT_KEY") 
val decryptedPassword = EncryptionUtil.decrypt(encryptedPassword, saltKey)val options = Map(
  "url" -> "jdbc:mysql://your_database_url",
  "driver" -> "com.mysql.jdbc.Driver",
  "user" -> "your_username",
  "password" -> decryptedPassword
)
val df = spark.read.jdbc(options("url"), "your_table", options)Created 01-08-2024 05:34 PM
@cardozogp Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
Regards,
Diana Torres,Created 01-09-2024 03:40 AM
As I was already using the Hadoop Credential Provider, I found a solution that does not require decrypting the password as follows:
PySpark code:
# Spark session
spark = SparkSession.builder \
    .config("spark.yarn.keytab=/etc/security/keytabs/<APPLICATION_USER>.keytab") \
    .appName('SPARK_TEST') \
    .master("yarn") \
    .getOrCreate()
	
credential_provider_path = 'jceks://hdfs/<PATH>/<CREDENTIAL_FILE>.jceks' 
credential_name = 'PASSWORD.ALIAS'
# Hadoop credential
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set('hadoop.security.credential.provider.path',credential_provider_path)
credential_raw = conf.getPassword(credential_name)
for i in range(credential_raw.__len__()):
    password = password + str(credential_raw.__getitem__(i))The important point above is the .config() line in SparkSession. You must enter the keytab to access the password. Otherwise you will get the encrypted value.
I can't say that I'm very happy with being able to directly manipulate the password value in the code. I would like to delegate this to some component in a way that the programmer does not have direct access to the password value.
Maybe what I'm looking for is some kind of authentication provider, but for now the solution above works for me.
 
					
				
				
			
		
