Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How you can increase the speed of data processing

Explorer

How you can increase the speed of data processing?

I have script

select 
pgw_in.numrec,pgw_in.datavolumefbcdownlink,pgw_in.datavolumefbcuplink,pgw_in.duration,pgw_in.pgwaddress,pgw_in.servedimeisv,pgw_in.servedimsi,pgw_in.servedmsisdn,pgw_in.timeoffirstusage,pgw_in.timeoflastusage,pgw_in.userlocationinformation,pgw_oracle.idfrom 
pgw.pgw_in,pgw.pgw_oracle where 
(cast(unix_timestamp(substr(pgw_in.timeoffirstusage,2,9),"yyyy-M-dd") as 
 timestamp)>=pgw_oracle.firstdatetime 
 and cast(unix_timestamp(substr(pgw_in.timeoffirstusage,2,9),"yyyy-M-dd") as 
 timestamp)<=pgw_oracle.seconddatetime 
 and ( ( pgw_in.servedimsi=pgw_oracle.servedimsi and 
 pgw_in.servedimeisv=pgw_oracle.servedimeisv and 
 pgw_in.servedmsisdn=pgw_oracle.servedmsisdn)
 or ( pgw_oracle.servedimeisv="null" and 
 pgw_in.servedimsi=pgw_oracle.servedimsi and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn)
 or ( pgw_oracle.servedimeisv=pgw_in.servedimeisv and 
 pgw_oracle.servedimsi="null" and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn)
 or ( pgw_oracle.servedimsi="null" and pgw_oracle.servedimeisv="null" and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn))) 

Tables - pgw_in,pgw_oracle.
The script runs slowly, about 15 minutes to process data.

4 REPLIES 4

Cloudera Employee
Please upload a query profile.

Explorer
Explain String

  Explain String

1Estimated Per-Host Requirements: Memory=1.88GB VCores=2
2WARNING: The following tables are missing relevant table and/or column statistics.
3pgw.pgw_in, pgw.pgw_oracle
4 
5PLAN-ROOT SINK
6|
704:EXCHANGE [UNPARTITIONED]
8|
902:NESTED LOOP JOIN [INNER JOIN, BROADCAST]
10|  predicates: ((CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP) >= pgw_oracle.firstdatetime AND CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP) <= pgw_oracle.seconddatetime) OR (CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 10), 'yyyy-M-dd') AS TIMESTAMP) >= pgw_oracle.firstdatetime AND CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 10), 'yyyy-M-dd') AS TIMESTAMP) <= pgw_oracle.seconddatetime)), (pgw_in.servedimsi = pgw_oracle.servedimsi AND ((pgw_in.servedimeisv = pgw_oracle.servedimeisv AND pgw_in.servedmsisdn = pgw_oracle.servedmsisdn) OR (pgw_oracle.servedimeisv = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn)) OR (pgw_oracle.servedimeisv = pgw_in.servedimeisv AND pgw_oracle.servedimsi = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn) OR (pgw_oracle.servedimsi = 'null' AND pgw_oracle.servedimeisv = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn))
11|
12|--03:EXCHANGE [BROADCAST]
13|  |
14|  01:SCAN HDFS [pgw.pgw_oracle]
15|     partitions=1/1 files=0 size=0B
16|
1700:SCAN HDFS [pgw.pgw_in]
18   partitions=1/1 files=677 size=648.06GB

Explorer
Partitioning may help if entire table scan is not required, also some processing can be saved instead of (CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP
convert pgw_oracle.firstdatetime into unix_timestamp.

Explorer
Apply:

INVALIDATE METADATA pgw.pgw_in;
ALTER TABLE pgw.pgw_in RECOVER PARTITIONS;
COMPUTE STATS pgw.pgw_in;

(WARNING: The following tables are missing relevant table and/or column statistics.)
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.