Support Questions

Find answers, ask questions, and share your expertise

16 billion Rows merged with 30 million rows on Daily Basis

New Contributor

Hi Team,

I have a job which merges daily 30 million rows into already existing 16billion rows(2018 -2022) year of data.Both the tables are partitioned by year, month, date.I am using spark 2.3x using HDP version 2.5 I believe.I am facing ingestion bottleneck and job is taking very long to insert 16 billion rows daily into hive tables. We have given close to 20TB of queue in yarn and almost 400 executors are used, but still there is no improvement. The SLA is daily 5 AM in the morning and all the job runs in 23 hours within a jobplan(UC4).


Could you please help me to resolve this bottleneck?




Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.