Support Questions

Find answers, ask questions, and share your expertise

How to measure network traffic for a query?

Explorer

I would like to investigate the total amount of bytes which are sent over the network for a query (especially the exchanges). I came across the "impala_query_bytes_streamed_rate" metric but I don't get any data. 

 

My query summary looks like following:

 

Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                              
--------------------------------------------------------------------------------------------------------------------------------------
08:EXCHANGE               1    0.000ns    0.000ns    1.02K      38.14K   12.57 MB              0  UNPARTITIONED                       
04:HASH JOIN              6  494.689ms  584.026ms   91.14K      38.14K  112.07 MB       17.00 MB  INNER JOIN, PARTITIONED             
|--07:EXCHANGE            6   38.668ms   68.002ms    5.51M     934.26K  200.00 KB              0  HASH(pp.event_id,pp.track_track_id) 
|  03:HASH JOIN           6    6s391ms    7s244ms    5.51M     934.26K  352.42 MB      138.01 MB  INNER JOIN, BROADCAST               
|  |--05:EXCHANGE         6  117.338ms  136.006ms    7.36M      13.16M   12.68 MB              0  BROADCAST                           
|  |  00:SCAN HDFS        6  209.343ms  400.022ms    7.36M      13.16M   86.28 MB      352.00 MB  default.protoparticles_32 pp        
|  02:SCAN HDFS           6    1s887ms    3s516ms  198.27M      31.47M    3.54 GB      880.00 MB  default.track_state_32 ts           
06:EXCHANGE               6    6.000ms   16.000ms    1.48M      13.08M   12.67 MB              0  HASH(tr.event_id,tr.track_id)       
01:SCAN HDFS              6    1s174ms    1s344ms    2.26M      13.08M  329.22 MB      528.00 MB  default.track_32 tr

I guess the exchange times do only include the hashing?!

 

In the profile I can see all DataStreamSenders like the one for 05 EXCHANGE:

 

KrpcDataStreamSender (dst_id=5):(Total: 1s280ms, non-child: 188.008ms, % non-child: 14.69%)
          BytesSent(500.000ms): 3.70 MB, 38.77 MB, 63.08 MB
           - EosSent: 6 (6)
           - NetworkThroughput: 16.01 MB/sec
           - PeakMemoryUsage: 69.80 KB (71472)
           - RowsSent: 1.34M (1342523)
           - RpcFailure: 0 (0)
           - RpcRetry: 0 (0)
           - SerializeBatchTime: 80.003ms
           - TotalBytesSent: 73.06 MB (76604592)
           - UncompressedRowBatchSize: 145.96 MB (153047622)

 

So this is one out of six DataStreamSenders for the 05 Exchange. Is TotalBytesSent the amount of bytes this node sents to node5? What are BytesSent and the comma separated list of MBs? What is the time in brackets?
If I add all six TotalBytesSent values up, I get ~400MB. Do I have to calculate the time myself (based on the given throughput)?

 

Additionally I have six EXCHANGE_NODEs: 

EXCHANGE_NODE (id=5):(Total: 1s488ms, non-child: 120.005ms, % non-child: 8.06%)
             - ConvertRowBatchTime: 104.004ms
             - PeakMemoryUsage: 12.68 MB (13292376)
             - RowsReturned: 7.36M (7358720)
             - RowsReturnedRate: 4.95 M/sec

All six exchange nodes look roughly the same. What is the Total time of 1s488ms in this context?

 

 

I would be very thankful if anyone could clearify this to me. Thank you very much!

 

 

Version: Cloudera Express 5.15.0  

9 REPLIES 9

Explorer

Can anyone give me a hint why impala_query_bytes_streamed_rate is empty or provide me any other examples of how to get these data?

Cloudera Employee

@ImpalaStorm wrote:

 


Q: I guess the exchange times do only include the hashing?!

 

A: The hashing (if any) is actually done in the DataStreamSender. I cannot tell from the profile if the data shuffling strategy is broadcast or hash-partitioned.

  

Q: Is TotalBytesSent the amount of bytes this node sents to node5?

 

A: Yes, this is the total byte sent to all instances of node5.

 

Q: What are BytesSent and the comma separated list of MBs? What is the time in brackets?

 

Q: This is a time series counter of the value TotalBytesSent. The value in the bracket is the period in which the samples were taken. By defult, samples are taken every 500ms. However, as there is a bound for the maximum number of samples kept in the time series, we start merging samples once the maximum length of the time series is reached and that's when you'll see a value different from the default sampling period.

 

Q: If I add all six TotalBytesSent values up, I get ~400MB. Do I have to calculate the time myself (based on the given throughput)?

 

A: Yes, dividing the total bytes sent / total throughput should give an approximated network time for a particular DataStreamSender. Please note that a DataStreamSender may send to multiple receivers in parallel so the network time may not necessarily map to wall clock time.

 

Q: All six exchange nodes look roughly the same. What is the Total time of 1s488ms in this context?

 

A: This is the total time spent in the exchange node, including the wait time in the receiver waiting for data to arrive. The non-child time is essentially measuring the active time spent executing any code in exchange node. The rest will mostly be due to wait time. "DataWaitTime" in profile records the amount of time the DataStreamReceiver spent waiting for data to arrive.

Explorer

Thank you very much for the explanation!

I get following values for an instance:

 

Instance 7b4bacec742ce503:affa1dd300000008 (host=slave-cn02:22000):(Total: 1s676ms, non-child: 8.000ms, % non-child: 0.48%)
ThreadUsage(500.000ms): 3, 2, 2
...
- RowsProduced: 1.34M (1342523)
- TotalNetworkReceiveTime: 0.000ns
- TotalNetworkSendTime: 1s156ms
...

KrpcDataStreamSender (dst_id=5):(Total: 1s328ms, non-child: 172.009ms, % non-child: 12.95%)
BytesSent(500.000ms): 14.88 MB, 53.23 MB, 65.52 MB
- EosSent: 6 (6)
- NetworkThroughput: 15.87 MB/sec
- PeakMemoryUsage: 69.80 KB (71472)
- RowsSent: 1.34M (1342523)
- RpcFailure: 0 (0)
- RpcRetry: 0 (0)
- SerializeBatchTime: 84.004ms
- TotalBytesSent: 73.06 MB (76604592)
- UncompressedRowBatchSize: 145.96 MB (153047622)

HDFS_SCAN_NODE (id=0):(Total: 280.015ms, non-child: 280.015ms, % non-child: 100.00%)
...
- BytesRead: 27.57 MB (28912212)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 25.83 MB (27084442)
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 25.83 MB (27084442)
...
- RowsReturned: 1.34M (1342523)
...

So in total 27.57 MB are read from HDFS. 25.83 MB are read via ShortCircuit which should result in no network traffic. 1,74 MB must be remote reads then. When do remote reads occur? Is it only when there're no co-located data or also when one node (thread) has finished reading its local data and could help another node reading the data?

 

Up to this point I only have 1,74MB network traffic, right? Now the whole 27.57 MB must be sent over network to node 5. Where do the TotalBytesSent of 73.06 MB in the KrpcDataStreamSender come from? RowsSent in the Sender and RowsReturned in the scan node are identical. Shouldn't it only be 27.57 MB?

 

If I make the calculation 73.06 MB / 15.87 MB/sec I get 4.6 sec for the data transfer. Is the NetworkThroughput correct? Actually I have a 10GB/s network. The highest value I got for a Datastreamer are 351 MB/sec.

Explorer

Sorry for the circumstances but I hope you can help me with the composition of "TotalBytesSent" in the DataStreamer and the creation of remote bytes. I would expect the value BytesRead(27.57 MB) * 6 (nodes) = .165,42 MB instead of 73.06 MB. Or maybe multiplied with 5 instead of 6 because the Streamer doesn't need to sent the data to itself. 

 

Am I correct that all the data sent over network consist of: remote read bytes + bytes sent by Datastreamers?

Remote bytes = BytesRead - BytesReadLocal
Time = Remote bytes / NetworkThroughput

Bytes sent by DataStreamers = TotalBytesSent
Time = TotalBytesSent / TotalReadThroughput

 

Thanks for your time and help!

Cloudera Employee

The remote read statistics should be recorded in BytesReadRemoteUnexpected.

 

Please note that the 27.57MB read from HDFS is the raw HDFS block but the actual table data may be compressed. The scan node will unpack the block, parse it based on the file format and convert them into row batches. So, the 27.57MB read can be unpacked into something larger. The 73.06MB recorded in the DataStreamSender is the total number of bytes sent across the network. If we are broadcasting the data to all destination exchange nodes, the total bytes sent will be (row batch size * num destination nodes). If we are using hash partitioning or random, the number of bytes sent should equal to row batch size. I cannot tell from the quote above whether we are using hash partitioning or broadcasting. It may help if you attach the entire query profile. Please also note that DataStreamSender will compress the row batches before sending them and TotalBytesSent correspond to the actual number of bytes which hit the network after compression. The total size before compression is recorded in UncompressedRowBatchSize.

 

The TotalNetworkThroughput seems a bit wonky and I filed IMPALA-7449 to fix it.

Explorer

Thank you very much!

 

"Please note that the 27.57MB read from HDFS is the raw HDFS block"

Okay this makes sense as there are also HDFS scans with RowsReturned: 0 but BytesRead: 500.00 KB. This seems to be the block overhead then. This would also fit to the following calculation:
HDFS:

 

- BytesRead: 27.57 MB (28912212)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 25.83 MB (27084442)
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 25.83 MB (27084442)

Sender:

- TotalBytesSent: 73.06 MB (76604592)
- UncompressedRowBatchSize: 145.96 MB (153047622)

 

"If we are broadcasting the data to all destination exchange nodes, the total bytes sent will be (row batch size * num destination nodes)"

Yes, it's a broadcast at this position and it says instances = 6.

UncompressedRowBatchSize / 6 = Single UncompressedRowBatchSize 

145.96 MB / 6 = 24.33 MB

 

But 24.33 MB is even smaller then the compressed BytesRead of 27.57 MB (also if I subtract 0.5 MB for the block overhead if this should be correct).  Am I missing something here?

 

"The remote read statistics should be recorded in BytesReadRemoteUnexpected."

BytesReadRemoteUnexpected is always 0. Isn't it only the amount of bytes which weren't expected to be read remotely by the planner (like the name suggests)?

I have BytesRead of 27.57 MB in total and BytesReadLocal of 25.83 MB. So 1,74 MB are not listed. Isn't this the amount of remote bytes? So only these bytes are sent over network while scanning, right? And is there another possible scenario of why remote bytes can occur except of when a block is not present on the local node?

"If we are using hash partitioning or random, the number of bytes sent should equal to row batch size."
I have a scenario with a hash exchange (also six instances):

HDFS:

 

 -TotalBytesSent: 8.68 MB (9101183)
 - UncompressedRowBatchSize: 16.21 MB (17000896)
        

Sender:

 

 

HDFS_SCAN_NODE (id=1):(Total: 976.044ms, non-child: 976.044ms, % non-child: 100.00%)
          Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:11/1.13 GB 
          ExecOption: PARQUET Codegen Enabled
          Runtime filters: Not all filters arrived (arrived: [], missing [0, 1]), waited for 903ms
          BytesRead(16s000ms): 242.40 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB
           - FooterProcessingTime: (Avg: 26.910ms ; Min: 12.000ms ; Max: 44.002ms ; Number of samples: 11)
           - AverageHdfsReadThreadConcurrency: 0.01 
           - AverageScannerThreadConcurrency: 4.00 
           - BytesRead: 268.32 MB (281349378)
           - BytesReadDataNodeCache: 0
           - BytesReadLocal: 0
           - BytesReadRemoteUnexpected: 0
           - BytesReadShortCircuit: 0

Don't I have to divide UncompressedRowBatchSize by 6 in order to get the single UncompressedRowBatchSize (value of uncompressed BytesRead)? Because UncompressedRowBatchSize is the total value and I also have six instances. And I guess BytesRead is too huge here because the filters haven't arrived yet. 

But how can it be that BytesRead are 268.32 MB and BytesReadLocal and all other values are 0?

 

Thank you very much in advance! I think this is everything in order to analyse the network traffic in detail. 

 

Complete query profile: Profile (was to large to append it here).

 

 

 

Explorer

I think it's just fine to sum up the TotalBytesSent for each DatastreamSender. If I'm correct, SerializeBatchTime + DeserializeRowBatchTime is the time to de+compress the data. Does Impala consider the network bandwith for the compression? I use a 10Gb/s network and have an exchange with TotalBytesSent of 975 MB. De+compression take ~ 1,38 sec according to the values I mentioned. Wouldn't it be faster to ignore the compression in this case (even if the size doubles)?

 Exchange ID:6 --> Profile

Explorer

@kwho: Sorry for the inconvenience, but the question from my last post is very important to me. You can ignore the post before but the compresison time vs. bandwidth case is really important to me. I hope you can help me with that!

Cloudera Employee

Really sorry for the late reply. This somehow fell off my radar. To answer your question, for the exchange node 6, you can sum the TotalBytesSent for each DatastreamSender for the total bytes shuffled between all instances of F00 to all instances of exchange node 6.

 

Impala currently doesn't consider network bandwidth to decide whether it will compress row batches or not. It's true that if the network bandwidth is plentiful, we may save some CPU time by skipping the compression/decompression step. For multi-rack deployment, the network bandwidth is likely bound by the throughput of top-of-rack switch so compression usually helps in this case. Please let me know if I may have missed other questions.