I am using HDP 2.3.2 sandbox on virtualbox. I have observed wired problem while uploading the file using Ambari.
I am uploading simple text file with 150 records. I have uploaded the same to HDFS using ambari tool. Ambari->HDFS Files-> /user/spark directory. file upload works fine but when tried downloading the same file i am getting extra records in the same file.
Even after deleting the same file multiple times and restarting the VM i am getting the same issue.
Please upload a file with some other name this time like "movie3" (Instead of "movie2") to see if you are facing the same issue?
I am suggesting the above test because, Ambari File view does not do that(it does not automatically writes to a file). It may be some other Job that might be running at the background and keep on updating the file with some other "movie2" data which has more records and hence you might be seeing this.
Yes I did that too but I faced the same problem. I have been doing this since around 2 days and finally I got the root of the issue.
I cleaned up everything and make sure no other job running in background as well, cleaned up thrash too.
If I upload the csv with 10 records it is not causing any issue but if you have csv with more than 10 records and if I am uploading from Ambari hdfs files view, the file get corrupted. When i did the same file upload using command line it is all fine no issues.
Basically I uploaded csv to /home/sparkFiles first using winscp and then moved the file using hdfs dfs -put that way it worked fine.
Is this happening in some other directories as well (other than /user/spark directory) ?
After uploading the File, Do you see that the "Last Modified" Column value is getting changed?
Please check the HDFS Audit log.
Better to perform the same operation again and this time put the hdfs-audit.log in tail mode to see who is actually changing?
# tail -f /var/log/hadoop/hdfs/hdfs-audit.log