Created on 06-17-2016 07:12 AM - edited 09-16-2022 03:25 AM
I have a query which is - join between 2 tables. One has 1000 rows in it (broadcast, right side table) and the other has 15 million rows. The result of join is 28 million rows. Shouldn't take 33 minutes with impala I assume. The part of the profile which shows the most time taken...
Averaged Fragment F00:(Total: 17m36s, non-child: 0.000ns, % non-child: 0.00%)
split sizes: min: 20.40 MB, max: 250.81 MB, avg: 135.60 MB, stddev: 115.20 MB
completion times: min:2m2s max:33m11s mean: 17m37s stddev:15m34s
execution rates: min:128.98 KB/sec max:170.00 KB/sec mean:149.49 KB/sec stddev:20.51 KB/sec
num instances: 2
- AverageThreadTokens: 1.93
- BloomFilterBytes: 2.00 MB (2097152)
- PeakMemoryUsage: 39.61 MB (41537580)
- PerHostPeakMemUsage: 41.61 MB (43634732)
- PrepareTime: 142.003ms
- RowsProduced: 14.41M (14412000)
- TotalCpuTime: 34m58s
- TotalNetworkReceiveTime: 0.000ns
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 18.000ms
BlockMgr:
- BlockWritesOutstanding: 0 (0)
- BlocksCreated: 32 (32)
- BlocksRecycled: 1 (1)
- BufferedPins: 0 (0)
- BytesWritten: 0
- MaxBlockSize: 8.00 MB (8388608)
- MemoryLimit: 6.85 GB (7360164864)
- PeakMemoryUsage: 32.00 KB (32768)
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
CodeGen:(Total: 736.017ms, non-child: 736.017ms, % non-child: 100.00%)
- CodegenTime: 0.000ns
- CompileTime: 184.004ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.85 MB (1940340)
- OptimizationTime: 414.010ms
- PrepareTime: 136.003ms
HdfsTableSink:(Total: 5s532ms, non-child: 5s532ms, % non-child: 100.00%)
- BytesWritten: 1.74 MB (1824896)
- CompressTimer: 14.000ms
- EncodeTimer: 5s506ms
- FilesCreated: 0 (0)
- FinalizePartitionFileTimer: 32.000ms
- HdfsWriteTimer: 6.000ms
- PartitionsCreated: 0 (0)
- PeakMemoryUsage: 2.52 MB (2637684)
- RowsInserted: 14.41M (14412000)
- TmpFileCreateTimer: 12.000ms
HASH_JOIN_NODE (id=2):(Total: 17m30s, non-child: 17m30s, % non-child: 99.98%)
- BuildPartitionTime: 0.000ns
- BuildRows: 1.00K (1000)
- BuildRowsPartitioned: 1.00K (1000)
- BuildTime: 2.000ms
- GetNewBlockTime: 0.000ns
- HashBuckets: 2.05K (2048)
- HashCollisions: 0 (0)
- LargestPartitionPercent: 100 (100)
- MaxPartitionLevel: 0 (0)
- NumRepartitions: 0 (0)
- PartitionsCreated: 16 (16)
- PeakMemoryUsage: 2.05 MB (2149536)
- PinTime: 0.000ns
- ProbeRows: 2.59M (2590143)
- ProbeRowsPartitioned: 0 (0)
- ProbeTime: 17m30s
- RowsReturned: 14.41M (14412000)
- RowsReturnedRate: 7.28 K/sec
- SpilledPartitions: 0 (0)
- UnpinTime: 0.000ns
EXCHANGE_NODE (id=3):
- BytesReceived: 10.60 KB (10850)
- ConvertRowBatchTime: 0.000ns
- DeserializeRowBatchTimer: 0.000ns
- FirstBatchArrivalWaitTime: 0.000ns
- PeakMemoryUsage: 0
- RowsReturned: 1.00K (1000)
- RowsReturnedRate: 0
- SendersBlockedTimer: 0.000ns
- SendersBlockedTotalTimer(*): 0.000ns
HDFS_SCAN_NODE (id=0):(Total: 238.005ms, non-child: 238.005ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.93
- BytesRead: 23.78 MB (24934683)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 23.78 MB (24934683)
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 23.78 MB (24934683)
- DecompressionTime: 40.000ms
- MaxCompressedTextFileLength: 0
- NumColumns: 9 (9)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- PeakMemoryUsage: 31.07 MB (32577052)
- PerReadThreadRawHdfsThroughput: 1.08 GB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 7.96M (7959877)
- RowsReturned: 2.59M (2590143)
- RowsReturnedRate: 8.73 M/sec
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 6.36K (6360)
- ScannerThreadsTotalWallClockTime: 17m21s
- MaterializeTupleTime(*): 17m20s
- ScannerThreadsSysTime: 74.049ms
- ScannerThreadsUserTime: 12s312ms
- ScannerThreadsVoluntaryContextSwitches: 2.55K (2552)
- TotalRawHdfsReadTime(*): 20.000ms
- TotalReadThroughput: 27.75 KB/sec
Filter 0:
- Rows processed: 16.38K (16383)
- Rows rejected: 0 (0)
- Rows total: 16.38K (16384)
I see that the execution rates are in the range of 140KB/sec. I have seen other profile dumps in questions posted and some were in the range of 5MB/sec or even 25MB/sec... Not sure if this rate is the root cause but just wanted to know what it is and how I can improve it, if at all...
Created 06-17-2016 04:04 PM
Ok, that's interesting. The nonsense-looking symbol up the top is probably jitted code from your query, probably an expression or something like that.
36.50% perf-18476.map [.] 0x00007f3c1d634b82
The other symbols like GetDoubleVal() may be what is calling this expensive function. It looks like it's possible ProbeTime in the profile that's the culprit.
Can you share the SQL for your query at all? I'm guessing that there's some expression in your query that's expensive to evaluate. E.g. joining on some complex expression, or doing some kind of expensive computation.
Created 06-17-2016 09:19 AM
There's something strange going on here, the profile reports that the scan took around 12 seconds of CPU time, but 17 minutes of wall-clock time. So for whatever reason the scan is spending most of its time swapped out and unable to execute.
- MaterializeTupleTime(*): 17m20s
- ScannerThreadsSysTime: 74.049ms
- ScannerThreadsUserTime: 12s312ms
Is the system under heavy load or is it swapping to disk?
Created 06-17-2016 10:08 AM
There are no other jobs running on this cluster. This is dev cluster with 5 nodes, the join however runs only on one node and at the time this runs, the load average is consistently > 1, for as long as the join runs. The host CPU usage however is consistently at 50%. The swap rate, disk latency etc., did not spike during the run. I can run it again and check. Is there anything in particular you want me to capture?
There are 2 cores on that node (join only runs on one right?).
probably not relevant: This is a "create table as select..." query. I ran it without the "create table" part and it still took the same amount of time. So writing to disk is *probably* not the issue.
Created 06-17-2016 11:23 AM
Maybe run 'perf top' to see where it's spending the time?
I'd expect the scan to run on one core and the join and insert to run on a different core.
Created 06-17-2016 03:39 PM
This is pretty much how it looks for the entire time... give or take few percentage points for the top 2 rows.
36.50% perf-18476.map [.] 0x00007f3c1d634b82 ◆
14.61% impalad [.] impala::ExprContext::GetValue(impala::Expr*, impala::TupleRow*) ▒
6.77% impalad [.] impala::HiveUdfCall::Evaluate(impala::ExprContext*, impala::TupleRow*) ▒
5.26% libjvm.so [.] jni_invoke_nonstatic(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*) ▒
3.02% libjvm.so [.] JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*) ▒
2.57% impalad [.] impala::ScalarFnCall::GetDoubleVal(impala::ExprContext*, impala::TupleRow*)
Created 06-17-2016 03:48 PM
After the query completes, this is how it looks -
8.69% python2.7 [.] 0x00000000000eec22 ◆
5.70% python2.7 [.] PyEval_EvalFrameEx ▒
5.24% perf-18476.map [.] 0x00007f3c1d634b82 ▒
4.39% [unknown] [.] 0x0000000000487bd0 ▒
4.11% perf-12892.map [.] 0x00007f8e45020431 ▒
3.74% perf [.] 0x00000000000caa5b ▒
3.64% libc-2.19.so [.] 0x000000000015fc60 ▒
1.96% impalad [.] impala::ExprContext::GetValue(impala::Expr*, impala::TupleRow*)
Created 06-17-2016 04:04 PM
Ok, that's interesting. The nonsense-looking symbol up the top is probably jitted code from your query, probably an expression or something like that.
36.50% perf-18476.map [.] 0x00007f3c1d634b82
The other symbols like GetDoubleVal() may be what is calling this expensive function. It looks like it's possible ProbeTime in the profile that's the culprit.
Can you share the SQL for your query at all? I'm guessing that there's some expression in your query that's expensive to evaluate. E.g. joining on some complex expression, or doing some kind of expensive computation.
Created 06-17-2016 08:30 PM
Thank you for explaining it. Its a function call. Changing it and will see the impact. Will come back with results...