Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Rising Star

Hi,

I have a table with a lot of data,

I want to create a new table based on some column values from this based

which method is most efficient and cluster resources friendly

Pseudo-Code

1. single job

insert into myNewTable

select * from myOldTable

where a=xxx etc.

2. two jobs:

job1. create datafame from select statement

select * from myOldTable

where a=xxx etc. as dataframe

job2 write dataframe as new table

insert into myNewTable select from dataframe

1,128 Views

1 ACCEPTED SOLUTION

Super Guru

Hi,

I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.

Cheers
Eric

View solution in original post

1,052 Views

1 REPLY 1

Super Guru

Hi,

I do not think there is any different. Spark lazily executes statements, so you second 2 jobs version will behave the same way as the first single job, in my opinion.

Cheers
Eric

1,053 Views

Announcements

What's New @ Cloudera

Cloudera Operational Database supports assigning ODAdmin rol...

What's New @ Cloudera

Security-Enhanced Linux (SELinux) support in Cloudera Operat...

What's New @ Cloudera

[RELEASED] Cloudera Streams Messaging - Kubernetes Operator ...

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics 1.14 for Cloudera Pu...

What's New @ Cloudera

Cloudera Data Engineering 1.23: Access Spark from Your Favor...