Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

slow reported throughput in DataStreamSender

Highlighted

slow reported throughput in DataStreamSender

Explorer

We are one specific query, "insert overwrite select * from a partitions" that usually take < 2min.

Occasionaly , it takes over 5 min for unknown reason.

Looking at the query profile, we see the following below. 

Throughput of < then 600kb sec is unexpected since we have 10GB network...

So what could explain this? 

Is it releated to ThriftTransmitTime?

Thanks

 

DataStreamSender (dst_id=1) (5.0m)

  • AsyncTotalTime: 0ns
  • BytesSent: 169.9 MiB
  • InactiveTotalTime: 0ns
  • NetworkThroughput(*): 595.0 KiB/s
  • OverallThroughput: 585.7 KiB/s
  • PeakMemoryUsage: 72.0 KiB
  • SerializeBatchTime: 2.77s
  • ThriftTransmitTime(*): 4.9m
  • TotalTime: 5.0m
  • UncompressedRowBatchSize: 547.1 MiB
4 REPLIES 4

Re: slow reported throughput in DataStreamSender

Master Collaborator

Impala executes queries in a pipelined manner which means that if an operator further up the tree is slow, it will create back-pressure slow down all other operators in the pipeline. ThriftTransmitTime includes time spent waiting for upstream operators to process their queued input.


So it's possible there's some network issue (we would expect to get much higher throughput than that if the network is healthy) but its probably the upstream insert that is slow.

Re: slow reported throughput in DataStreamSender

Explorer

Thanks for quick help.

One last question . This snippet is frmo the same query as my initla question and I am wondering if this is the cause. EncodeTimer of 4min ... Does this rpresent the time taken to encode to parquet?

And could it be where the slow down is?

 

HdfsTableSink (4.4m)

  • AsyncTotalTime: 0ns
  • BytesWritten: 875.0 MiB
  • CompressTimer: 9.18s
  • EncodeTimer: 4.1m
  • FilesCreated: 4
  • FinalizePartitionFileTimer: 4.02s
  • HdfsWriteTimer: 3.93s
  • InactiveTotalTime: 0ns
  • PartitionsCreated: 1
  • PeakMemoryUsage: 319.9 MiB
  • RowsInserted: 3,720,826
  • TmpFileCreateTimer: 55ms
  • TotalTime: 4.4m

Re: slow reported throughput in DataStreamSender

Master Collaborator

That's exactly right, it looks like encoding the parquet file is taking all the time.

Re: slow reported throughput in DataStreamSender

Explorer

Thanks for the help.

4min for parquet encoding (800mb) seems high. Luckily its infrequent.

We are on impala 2.2 . Is this something that got improved in more recent version?