Member since
03-11-2018
12
Posts
0
Kudos Received
0
Solutions
08-22-2018
12:47 AM
@kwho: Sorry for the inconvenience, but the question from my last post is very important to me. You can ignore the post before but the compresison time vs. bandwidth case is really important to me. I hope you can help me with that!
... View more
08-19-2018
04:58 PM
I think it's just fine to sum up the TotalBytesSent for each DatastreamSender. If I'm correct, SerializeBatchTime + DeserializeRowBatchTime is the time to de+compress the data. Does Impala consider the network bandwith for the compression? I use a 10Gb/s network and have an exchange with TotalBytesSent of 975 MB. De+compression take ~ 1,38 sec according to the values I mentioned. Wouldn't it be faster to ignore the compression in this case (even if the size doubles)? Exchange ID:6 --> Profile
... View more
08-15-2018
05:19 AM
Thank you very much! "Please note that the 27.57MB read from HDFS is the raw HDFS block" Okay this makes sense as there are also HDFS scans with RowsReturned: 0 but BytesRead: 500.00 KB. This seems to be the block overhead then. This would also fit to the following calculation: HDFS: - BytesRead: 27.57 MB (28912212)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 25.83 MB (27084442)
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 25.83 MB (27084442) Sender: - TotalBytesSent: 73.06 MB (76604592)
- UncompressedRowBatchSize: 145.96 MB (153047622) "If we are broadcasting the data to all destination exchange nodes, the total bytes sent will be (row batch size * num destination nodes)" Yes, it's a broadcast at this position and it says instances = 6. UncompressedRowBatchSize / 6 = Single UncompressedRowBatchSize 145.96 MB / 6 = 24.33 MB But 24.33 MB is even smaller then the compressed BytesRead of 27.57 MB (also if I subtract 0. 5 MB for the block overhead if this should be correct). Am I missing something here? "The remote read statistics should be recorded in BytesReadRemoteUnexpected." BytesReadRemoteUnexpected is always 0. Isn't it only the amount of bytes which weren't expected to be read remotely by the planner (like the name suggests)? I have BytesRead of 27.57 MB in total and BytesReadLocal of 25.83 MB. So 1,74 MB are not listed. Isn't this the amount of remote bytes? So only these bytes are sent over network while scanning, right? And is there another possible scenario of why remote bytes can occur except of when a block is not present on the local node? "If we are using hash partitioning or random, the number of bytes sent should equal to row batch size." I have a scenario with a hash exchange (also six instances): HDFS: -TotalBytesSent: 8.68 MB (9101183)
- UncompressedRowBatchSize: 16.21 MB (17000896)
Sender: HDFS_SCAN_NODE (id=1):(Total: 976.044ms, non-child: 976.044ms, % non-child: 100.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:11/1.13 GB
ExecOption: PARQUET Codegen Enabled
Runtime filters: Not all filters arrived (arrived: [], missing [0, 1]), waited for 903ms
BytesRead(16s000ms): 242.40 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB, 268.32 MB
- FooterProcessingTime: (Avg: 26.910ms ; Min: 12.000ms ; Max: 44.002ms ; Number of samples: 11)
- AverageHdfsReadThreadConcurrency: 0.01
- AverageScannerThreadConcurrency: 4.00
- BytesRead: 268.32 MB (281349378)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0 Don't I have to divide UncompressedRowBatchSize by 6 in order to get the single UncompressedRowBatchSize (value of uncompressed BytesRead)? Because UncompressedRowBatchSize is the total value and I also have six instances. And I guess BytesRead is too huge here because the filters haven't arrived yet. But how can it be that BytesRead are 268.32 MB and BytesReadLocal and all other values are 0? Thank you very much in advance! I think this is everything in order to analyse the network traffic in detail. Complete query profile: Profile (was to large to append it here).
... View more
08-14-2018
01:35 PM
Sorry for the circumstances but I hope you can help me with the composition of "TotalBytesSent" in the DataStreamer and the creation of remote bytes. I would expect the value BytesRead(27.57 MB) * 6 (nodes) = .165,42 MB instead of 73.06 MB. Or maybe multiplied with 5 instead of 6 because the Streamer doesn't need to sent the data to itself. Am I correct that all the data sent over network consist of: remote read bytes + bytes sent by Datastreamers? Remote bytes = BytesRead - BytesReadLocal
Time = Remote bytes / NetworkThroughput
Bytes sent by DataStreamers = TotalBytesSent
Time = TotalBytesSent / TotalReadThroughput Thanks for your time and help!
... View more
08-12-2018
03:09 PM
Thank you very much for the explanation! I get following values for an instance: Instance 7b4bacec742ce503:affa1dd300000008 (host=slave-cn02:22000):(Total: 1s676ms, non-child: 8.000ms, % non-child: 0.48%)
ThreadUsage(500.000ms): 3, 2, 2
...
- RowsProduced: 1.34M (1342523)
- TotalNetworkReceiveTime: 0.000ns
- TotalNetworkSendTime: 1s156ms
...
KrpcDataStreamSender (dst_id=5):(Total: 1s328ms, non-child: 172.009ms, % non-child: 12.95%)
BytesSent(500.000ms): 14.88 MB, 53.23 MB, 65.52 MB
- EosSent: 6 (6)
- NetworkThroughput: 15.87 MB/sec
- PeakMemoryUsage: 69.80 KB (71472)
- RowsSent: 1.34M (1342523)
- RpcFailure: 0 (0)
- RpcRetry: 0 (0)
- SerializeBatchTime: 84.004ms
- TotalBytesSent: 73.06 MB (76604592)
- UncompressedRowBatchSize: 145.96 MB (153047622)
HDFS_SCAN_NODE (id=0):(Total: 280.015ms, non-child: 280.015ms, % non-child: 100.00%)
...
- BytesRead: 27.57 MB (28912212)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 25.83 MB (27084442)
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 25.83 MB (27084442)
...
- RowsReturned: 1.34M (1342523)
... So in total 27.57 MB are read from HDFS. 25.83 MB are read via ShortCircuit which should result in no network traffic. 1,74 MB must be remote reads then. When do remote reads occur? Is it only when there're no co-located data or also when one node (thread) has finished reading its local data and could help another node reading the data? Up to this point I only have 1,74MB network traffic, right? Now the whole 27.57 MB must be sent over network to node 5. Where do the TotalBytesSent of 73.06 MB in the KrpcDataStreamSender come from? RowsSent in the Sender and RowsReturned in the scan node are identical. Shouldn't it only be 27.57 MB? If I make the calculation 73.06 MB / 15.87 MB/sec I get 4.6 sec for the data transfer. Is the NetworkThroughput correct? Actually I have a 10GB/s network. The highest value I got for a Datastreamer are 351 MB/sec.
... View more
08-10-2018
03:15 AM
Can anyone give me a hint why impala_query_bytes_streamed_rate is empty or provide me any other examples of how to get these data?
... View more
08-06-2018
08:54 AM
I would like to investigate the total amount of bytes which are sent over the network for a query (especially the exchanges). I came across the "impala_query_bytes_streamed_rate" metric but I don't get any data. My query summary looks like following: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
--------------------------------------------------------------------------------------------------------------------------------------
08:EXCHANGE 1 0.000ns 0.000ns 1.02K 38.14K 12.57 MB 0 UNPARTITIONED
04:HASH JOIN 6 494.689ms 584.026ms 91.14K 38.14K 112.07 MB 17.00 MB INNER JOIN, PARTITIONED
|--07:EXCHANGE 6 38.668ms 68.002ms 5.51M 934.26K 200.00 KB 0 HASH(pp.event_id,pp.track_track_id)
| 03:HASH JOIN 6 6s391ms 7s244ms 5.51M 934.26K 352.42 MB 138.01 MB INNER JOIN, BROADCAST
| |--05:EXCHANGE 6 117.338ms 136.006ms 7.36M 13.16M 12.68 MB 0 BROADCAST
| | 00:SCAN HDFS 6 209.343ms 400.022ms 7.36M 13.16M 86.28 MB 352.00 MB default.protoparticles_32 pp
| 02:SCAN HDFS 6 1s887ms 3s516ms 198.27M 31.47M 3.54 GB 880.00 MB default.track_state_32 ts
06:EXCHANGE 6 6.000ms 16.000ms 1.48M 13.08M 12.67 MB 0 HASH(tr.event_id,tr.track_id)
01:SCAN HDFS 6 1s174ms 1s344ms 2.26M 13.08M 329.22 MB 528.00 MB default.track_32 tr I guess the exchange times do only include the hashing?! In the profile I can see all DataStreamSenders like the one for 05 EXCHANGE: KrpcDataStreamSender (dst_id=5):(Total: 1s280ms, non-child: 188.008ms, % non-child: 14.69%)
BytesSent(500.000ms): 3.70 MB, 38.77 MB, 63.08 MB
- EosSent: 6 (6)
- NetworkThroughput: 16.01 MB/sec
- PeakMemoryUsage: 69.80 KB (71472)
- RowsSent: 1.34M (1342523)
- RpcFailure: 0 (0)
- RpcRetry: 0 (0)
- SerializeBatchTime: 80.003ms
- TotalBytesSent: 73.06 MB (76604592)
- UncompressedRowBatchSize: 145.96 MB (153047622) So this is one out of six DataStreamSenders for the 05 Exchange. Is TotalBytesSent the amount of bytes this node sents to node5? What are BytesSent and the comma separated list of MBs? What is the time in brackets? If I add all six TotalBytesSent values up, I get ~400MB. Do I have to calculate the time myself (based on the given throughput)? Additionally I have six EXCHANGE_NODEs: EXCHANGE_NODE (id=5):(Total: 1s488ms, non-child: 120.005ms, % non-child: 8.06%)
- ConvertRowBatchTime: 104.004ms
- PeakMemoryUsage: 12.68 MB (13292376)
- RowsReturned: 7.36M (7358720)
- RowsReturnedRate: 4.95 M/sec All six exchange nodes look roughly the same. What is the Total time of 1s488ms in this context? I would be very thankful if anyone could clearify this to me. Thank you very much! Version: Cloudera Express 5.15.0
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-25-2018
04:50 AM
Thanks for your response. I used a statically linked Impala this time. create-test-configuration.sh give me following output: philipp@philipp:/media/philipp/f5de6362-e4f-4f0e-ab1a-4b13bedebfc2/Impala$ bin/create-test-configuration.sh
The minikdc is not running.
Creating node-3 at /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/testdata/cluster/cdh5/node-3
node-3 will use ports DATANODE_PORT=31000, DATANODE_HTTP_PORT=31010, DATANODE_IPC_PORT=31020, DATANODE_HTTPS_PORT=31030, NODEMANAGER_PORT=31100, NODEMANAGER_LOCALIZER_PORT=31120, NODEMANAGER_WEBUI_PORT=31140, KUDU_TS_RPC_PORT=31200, and KUDU_TS_WEBUI_PORT=31300
Creating node-2 at /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/testdata/cluster/cdh5/node-2
node-2 will use ports DATANODE_PORT=31001, DATANODE_HTTP_PORT=31011, DATANODE_IPC_PORT=31021, DATANODE_HTTPS_PORT=31031, NODEMANAGER_PORT=31101, NODEMANAGER_LOCALIZER_PORT=31121, NODEMANAGER_WEBUI_PORT=31141, KUDU_TS_RPC_PORT=31201, and KUDU_TS_WEBUI_PORT=31301
Creating node-1 at /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/testdata/cluster/cdh5/node-1
node-1 will use ports DATANODE_PORT=31002, DATANODE_HTTP_PORT=31012, DATANODE_IPC_PORT=31022, DATANODE_HTTPS_PORT=31032, NODEMANAGER_PORT=31102, NODEMANAGER_LOCALIZER_PORT=31122, NODEMANAGER_WEBUI_PORT=31142, KUDU_TS_RPC_PORT=31202, and KUDU_TS_WEBUI_PORT=31302
Config dir: /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources
Current user: philipp
Metastore DB: hive_impala
/media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala
Linking core-site.xml from local cluster
Linking hdfs-site.xml from local cluster
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/hive-site.xml
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/log4j.properties
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/hive-log4j.properties
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/hbase-site.xml
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/authz-policy.ini
Generated /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/fe/src/test/resources/sentry-site.xml
/media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala
Completed config generation
Searching for auxiliary tests, workloads, and datasets (if any exist).
No auxiliary tests found at: /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/../Impala-auxiliary-tests/testdata/workloads
No auxiliary tests found at: /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/../Impala-auxiliary-tests/testdata/datasets
No auxiliary tests found at: /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/../Impala-auxiliary-tests/tests Then, start-impala-cluster.py results in: Starting State Store logging to /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/logs/cluster/statestored.INFO
Starting Catalog Service logging to /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/logs/cluster/catalogd.INFO
Starting Impala Daemon logging to /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/logs/cluster/impalad.INFO
Starting Impala Daemon logging to /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/logs/cluster/impalad_node1.INFO
Starting Impala Daemon logging to /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/logs/cluster/impalad_node2.INFO
MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
MainThread: Getting num_known_live_backends from philipp:25000
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
MainThread: Debug webpage not yet available.
... /testdata/run-all.sh: Killing running services...
Starting cluster services...
Stopping kudu
Stopping kms
Stopping yarn
Stopping hdfs
Starting hdfs (Web UI - http://localhost:5070)
Namenode started
Starting yarn (Web UI - http://localhost:8088)
Starting kms (Web UI - http://localhost:9600)
Waiting for ntpd to synchronize... OK!
Starting kudu (Web UI - http://localhost:8051)
hdfs-datanode is not running on node-1
hdfs-datanode is not running on node-2
hdfs-datanode is not running on node-3
Error in /media/philipp/f5de6362-e43f-4f0e-ab1a-4b13bedebfc2/Impala/testdata/bin/run-mini-dfs.sh at line 40: $IMPALA_HOME/testdata/cluster/admin start_cluster
Error in testdata/bin/run-all.sh at line 42: tee ${IMPALA_CLUSTER_LOGS_DIR}/run-mini-dfs.log Unfortunately there's still the error with start_cluster.
... View more
06-18-2018
08:06 AM
I get the same error with Ubuntu 16.04 and the newest Impala version.
... View more
06-15-2018
06:46 AM
Does anyone have an idea what the problem might be? I'm stuck at this point 😕
... View more
06-13-2018
03:45 AM
Hi! I'm trying to setup a mini cluster with CDH5.13.1 for some experiments but unfortunately I always receive this error when running "run-all.sh": impala@impala:/Impala$ testdata/bin/run-all.sh
Killing running services...
Starting cluster services...
Stopping kudu
Stopping kms
Stopping yarn
Stopping hdfs
Starting hdfs (Web UI - http://localhost:5070)
Failed to start hdfs-datanode. The end of the log (/Impala/testdata/cluster/cdh5/node-1/var/log/hdfs-datanode.out) is:
Failed to start hdfs-datanode. The end of the log (/Impala/testdata/cluster/cdh5/node-3/var/log/hdfs-datanode.out) is:
Failed to start hdfs-datanode. The end of the log (/Impala/testdata/cluster/cdh5/node-2/var/log/hdfs-datanode.out) is:
Namenode started
Error in /Impala/testdata/bin/run-mini-dfs.sh at line 40: $IMPALA_HOME/testdata/cluster/admin start_cluster
Error in testdata/bin/run-all.sh at line 42: tee ${IMPALA_CLUSTER_LOGS_DIR}/run-mini-dfs.log As you can see, all node logs are empty. run-mini-dfs.log provides following information: Error in /Impala/testdata/bin/run-mini-dfs.sh at line 40: $IMPALA_HOME/testdata/cluster/admin start_cluster I don't know what's wrong with the start_cluster function. Further information: Build: ./buildall.sh -notests -so -start_minicluster Ubuntu 14.04 I hope you can help me to fix the problem.
... View more
Labels:
- Labels:
-
Apache Impala