Created on 05-31-2021 01:04 AM - edited on 06-02-2021 04:36 AM by VidyaSargur
######## DON'T
### Standalone Spark / CDH Spark
spark = SparkSession \
.builder \
.appName("Load Data") \
.config("spark.executor.cores", "6") \
.config("spark.executor.memory", "10g") \
.config("spark.num.executors", "2") \
.enableHiveSupport() \
.getOrCreate()
######## DO
### CDE Spark
spark = SparkSession \
.builder \
.appName("Load Data") \
.enableHiveSupport() \
.getOrCreate()
# cde cli job for the above settings
cde job create --application-file <path-to-file> --name <job_name> --num-executors 2 --executor-cores 6 --executor-memory "10g" --type spark
## Note Jobname is a CDE property and doesn't correspond to appName
cde job create --application-file <path-to-file> --name <job_name> --conf spark.kerberos.access.hadoopFileSystems=s3a://nyc-tlc,s3a://blaw-sandbox2-cdp-bucket --type spark
cde resource create --name my_custom_env --type python-env
cde resource upload --name my_custom_env --local-path requirements.txt
FROM container.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.1.1:1.7.0-b129
cde resource create --type="custom-runtime-image" --image-engine="<spark2 or 3>" --name="<runtime_name>" --image="<path_to_your_repo_and_image>"
cde job create --type --name <your_job_name> --runtime-image-resource-name <runtime_name> --application-file <your_job_file>
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
cde spark submit <scala_job>.scala --job-name <your_job_name>
https://crontab.guru/#*/30_*_*_*_*
my_airflow_task = CDEJobRunOperator(
task_id='loader', # a name to identify the task
dag=cde_process_dag, # the DAG that this operator will be attached to
job_name='load_data_job' # the job that we are running.
)
Note: The `job_name` must match how the job was set up in CDE.
start >> load_data_job >> etl_job >> end
This means that the execution order of the operators is `start`, then `load_data_job`, then `etl_job`, and finally `end`. We can also define the dependencies across multiple lines. For example, to do branching, we could define something as follows:
start >> load_data_job
load_data_job >> etl_job_1
load_data_job >> etl_job_2
Created on 08-23-2022 04:26 AM
Hello,
I am trying to interact with hdfs from my spark app running on data engineering experience to list the directories and see the size of a directory.
do you have any idea how to do that?
Thank you
Created on 08-23-2022 12:37 PM
Thanks for the blog!