Support Questions

ChineduLB · ‎07-06-2019

Hi,

I have a table with a lot of data,

I want to create a new table based on some column values from this based

which method is most efficient and cluster resources friendly

Pseudo-Code

1. single job

insert into myNewTable

select * from myOldTable

where a=xxx etc.

2. two jobs:

job1. create datafame from select statement

select * from myOldTable

where a=xxx etc. as dataframe

job2 write dataframe as new table

insert into myNewTable select from dataframe

EricL · ‎07-15-2019

Hi,

I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.

Cheers
Eric

View solution in original post

EricL · ‎07-15-2019

Hi,

I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.

Cheers
Eric

Cloudera Community

Support Questions

Spark create table from multiple jobs vs single job method