We have an Impala cluster of 44 data node and 1 name node. A SQL executed. We analysis the profile and found that 2 nodes sinks more data than others 18times , the mean is around 96MB but for node 4 and 40 ,the sink amount is 8816.84 MB. data node 4 is the coordinator , data 40 is an ordinary data node Any suggestion ? Jun p.s details Datanode10 Datanode11 Datanode12 Datanode13 Datanode14 Datanode15 Datanode16 Datanode17 Datanode18 Datanode19 96.58 96.74 96.82 96.66 96.70 97.25 96.77 96.79 96.62 96.61 Datanode20 Datanode21 Datanode22 Datanode23 Datanode25 Datanode26 Datanode27 Datanode28 Datanode29 Datanode30 96.85 96.77 97.31 97.24 96.62 96.77 96.97 96.67 96.33 96.84 Datanode32 Datanode33 Datanode34 Datanode35 Datanode36 Datanode37 Datanode38 Datanode39 Datanode4 Datanode40 96.92 97.06 97.17 96.79 97.47 97.11 96.90 96.92 8816.64 8816.64 Datanode41 Datanode42 Datanode43 Datanode44 Datanode45 Datanode46 Datanode47 Datanode48 Datanode5 Datanode6 97.22 96.60 96.49 96.56 97.07 96.89 96.52 96.81 97.08 96.63 Datanode7 Datanode8 Datanode9 tool 96.70 96.61 96.58 96.54
Given that you have more than a few nodes and it seems you are saying that all are accessed by the query, you might try rebalancing the data across the cluster to improve query performance.
If your HDFS replication factor is 2, it's possible the data on host 4 and 40 the same data. In any case, you probably want to determine which data is causing the skew.
If you can identify a few tables that are responsible for the skew, perhaps you can determine a way to split that data up into more partitions during ETL that might distribute it better across the cluster. Consider storing data in parquet format (if it's not already) to get compression and hence reduce the overall size of the data, and consider the value of the PARQUET_FILE_SIZE query option used to create the parquet data to split it up into smaller files. Best,