Support Questions

Find answers, ask questions, and share your expertise

How you can increase the speed of data processing

Explorer

How you can increase the speed of data processing?

I have script

select 
pgw_in.numrec,pgw_in.datavolumefbcdownlink,pgw_in.datavolumefbcuplink,pgw_in.duration,pgw_in.pgwaddress,pgw_in.servedimeisv,pgw_in.servedimsi,pgw_in.servedmsisdn,pgw_in.timeoffirstusage,pgw_in.timeoflastusage,pgw_in.userlocationinformation,pgw_oracle.idfrom 
pgw.pgw_in,pgw.pgw_oracle where 
(cast(unix_timestamp(substr(pgw_in.timeoffirstusage,2,9),"yyyy-M-dd") as 
 timestamp)>=pgw_oracle.firstdatetime 
 and cast(unix_timestamp(substr(pgw_in.timeoffirstusage,2,9),"yyyy-M-dd") as 
 timestamp)<=pgw_oracle.seconddatetime 
 and ( ( pgw_in.servedimsi=pgw_oracle.servedimsi and 
 pgw_in.servedimeisv=pgw_oracle.servedimeisv and 
 pgw_in.servedmsisdn=pgw_oracle.servedmsisdn)
 or ( pgw_oracle.servedimeisv="null" and 
 pgw_in.servedimsi=pgw_oracle.servedimsi and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn)
 or ( pgw_oracle.servedimeisv=pgw_in.servedimeisv and 
 pgw_oracle.servedimsi="null" and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn)
 or ( pgw_oracle.servedimsi="null" and pgw_oracle.servedimeisv="null" and 
 pgw_oracle.servedmsisdn=pgw_in.servedmsisdn))) 

Tables - pgw_in,pgw_oracle.
The script runs slowly, about 15 minutes to process data.

4 REPLIES 4

Cloudera Employee
Please upload a query profile.

Explorer
Explain String

  Explain String

1Estimated Per-Host Requirements: Memory=1.88GB VCores=2
2WARNING: The following tables are missing relevant table and/or column statistics.
3pgw.pgw_in, pgw.pgw_oracle
4 
5PLAN-ROOT SINK
6|
704:EXCHANGE [UNPARTITIONED]
8|
902:NESTED LOOP JOIN [INNER JOIN, BROADCAST]
10|  predicates: ((CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP) >= pgw_oracle.firstdatetime AND CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP) <= pgw_oracle.seconddatetime) OR (CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 10), 'yyyy-M-dd') AS TIMESTAMP) >= pgw_oracle.firstdatetime AND CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 10), 'yyyy-M-dd') AS TIMESTAMP) <= pgw_oracle.seconddatetime)), (pgw_in.servedimsi = pgw_oracle.servedimsi AND ((pgw_in.servedimeisv = pgw_oracle.servedimeisv AND pgw_in.servedmsisdn = pgw_oracle.servedmsisdn) OR (pgw_oracle.servedimeisv = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn)) OR (pgw_oracle.servedimeisv = pgw_in.servedimeisv AND pgw_oracle.servedimsi = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn) OR (pgw_oracle.servedimsi = 'null' AND pgw_oracle.servedimeisv = 'null' AND pgw_oracle.servedmsisdn = pgw_in.servedmsisdn))
11|
12|--03:EXCHANGE [BROADCAST]
13|  |
14|  01:SCAN HDFS [pgw.pgw_oracle]
15|     partitions=1/1 files=0 size=0B
16|
1700:SCAN HDFS [pgw.pgw_in]
18   partitions=1/1 files=677 size=648.06GB

Explorer
Partitioning may help if entire table scan is not required, also some processing can be saved instead of (CAST(unix_timestamp(substr(pgw_in.timeoffirstusage, 2, 9), 'yyyy-M-dd') AS TIMESTAMP
convert pgw_oracle.firstdatetime into unix_timestamp.

Explorer
Apply:

INVALIDATE METADATA pgw.pgw_in;
ALTER TABLE pgw.pgw_in RECOVER PARTITIONS;
COMPUTE STATS pgw.pgw_in;

(WARNING: The following tables are missing relevant table and/or column statistics.)