- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark create table from multiple jobs vs single job method
- Labels:
-
Apache Hive
-
Apache YARN
-
Cloudera Search
Created on 07-06-2019 09:57 AM - edited 09-16-2022 07:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a table with a lot of data,
I want to create a new table based on some column values from this based
which method is most efficient and cluster resources friendly
Pseudo-Code
1. single job
insert into myNewTable
select * from myOldTable
where a=xxx etc.
2. two jobs:
job1. create datafame from select statement
select * from myOldTable
where a=xxx etc. as dataframe
job2 write dataframe as new table
insert into myNewTable select from dataframe
Created 07-15-2019 03:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.
Cheers
Eric
Created 07-15-2019 03:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.
Cheers
Eric