I am a student and it is my first time to post question here.
If my question should be post at other place please give me a suggest.
Thanks all !
Here is my question:
I am doing a research about Hadoop with VM (using VMWare vSphere),
The environment I have is :
A hadoop cluster with 7 VMs, act as 1 NameNode + 6 DataNodes.
Each VM has following virtual haedware: 4 core CPU, 16GB memory, 2.7TB disk space,
and Operating System is CentOS 6.6 64bits.
Before this time, I had testing the same configuration on the other Cluster on different physical hosts with VMWare ESXi.
But this time I faced a problem like following picture:
In this test, like my previous test, I used TestDFSIO benchmark in Hadoop,
I used TestDFSIO to write 1TB files (1000 x 1G files) between VMs,
after starting test about 5~15 minutes, the Network IO speed will slow down from 300~400M/s to about 50M/s.
I had run the test 10 or more times, every test had similar situation like the picture,
sometimes high speed of Network IO will keep 5~15 minutes, after that, it will be slow(only 50M/s).
I had try a test that I replaced DataNodes to other VMs on the other physical hosts( they were normal in my previous test ),
the problem never happened.
I tried to use scp command in CentOS to transmit a file(4.4GB) from NameNode to a DataNode,
and I found that before the file transmitted to 70% (that is, the file has been transmitted about 3GB to other VM),
the speed is keeping about 110M/s,
but after 70% is done, the speed will getting slow, 100M/s... 90M/s... 80M/s... , and finally slow down to 15M/s,
and keeping this speed until the job is done.
I cannot find out the reason of this problem, it is really a strange situation.
Did you meet the same or similar problem before?
Thank you guys
Not sure if you have got any thoughts, But i am also hitting this same behavior. The network admins are not able to find any abnormalities as well.
Please share if you have got any resolution and cause of this issue
The Prime issue for this have been due to the Virtual machine (open stack instance) flavor.
If the Underlying infrastructure is using SRIOV ( https://docs.openstack.org/neutron/rocky/admin/config-sriov.html ) as the network configuration,
Then it is very important to enable HUGEPAGES on the Instance (the instance flavor) which is intended to be used to host Hadoop services.
We bounced the cluster, created new flavor with huge pages on, and launched the Hadoop instances. After which the entire Hadoop performance was as expected and the entire network was stabilized. All kinds of file transfer and hdfs get/put were very stable.
Hope this helps.