Support Questions

Find answers, ask questions, and share your expertise

Issue running spark jobs with Airflow

avatar
Explorer

Hi everyone,

 

So I've inherited a kerberized Cloudera cluster and I'm learning as I go. Right now I'm trying to get Airflow to work with our Spark jobs but without success. As I understand Airflow was installed by our OS team only after the cluster was configured by Cloudera. It runs on our edge node from where we run our jobs.

 

Basically I'm using bash operators for my test DAG with the following tasks:

 

Task 1:

Kinit the user that is running the script:
"echo 'password' | kinit user@domain"

 

Task 2:

Download some files from some location.

 

Task 3:
spark-submit /path/to/script.py 

 

Task 1 and 2 work fine, but task 3 fails with the following:

 

py4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
 
I am a bit confused by this as I am am authenticating the user as a first step. This exact workflow executes just fine when I run it manually in the CL.
 
Has anyone dealt with a similar issue? Any input would be appreciated as we need to transition to using Airflow.
 
Many thanks,
Mario
 
 
1 ACCEPTED SOLUTION

avatar
Master Collaborator

There are two solutions you can try.

1. Create one more shell operator and perform kinit and after that submit your spark

2. Pass the keytab and principal to the spark-submit

View solution in original post

6 REPLIES 6

avatar
Master Collaborator

Hi @imule 

 

In step3, could you please pass --keytab <key_tab_path> --principal <principal_name> to the spark-submit command.

 

Note: In CDP, Airflow integration is not yet we are supported.

avatar
Explorer

Hi @RangaReddy ,

Is there a way to generate the file myself or do I need to contact our Active Directory administrators for that?

Thank you

avatar
Master Collaborator

Hi @imule 

You can follow the following steps to generate the keytab and if you don't have permission, please check with your admin team.

https://docs.cloudera.com/data-hub/cloud/access-clusters/topics/dh-retrieving-keytabs.html

 

avatar
Community Manager

@imule, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.  



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Explorer

Hi again @RangaReddy ,

I'm sorry for the huge delay in reply, unfortunately this triggered a lengthy discussion between us and the AD team.

In the end we managed to get our hands on a keytab file, and we confirmed it works fine by manually submitting the below command:

kinit -k -t /path/to/keytab/file.keytab username

Unfortunately when we attempt to pass this with a bash operator from an Airflow DAG we get the same error:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
org.apache.hadoop.security.AccessControlExceptionorg.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
 
Thank you,
Mario

avatar
Master Collaborator

There are two solutions you can try.

1. Create one more shell operator and perform kinit and after that submit your spark

2. Pass the keytab and principal to the spark-submit