Created 03-13-2017 05:27 PM
Hi @Artem Ervits can you help here
Created 03-13-2017 06:11 PM
here's a slightly modified script from stack overflow thread
#!/bin/bash usage="Usage: dir_diff.sh [directory] [days]" if [[ $# -ne 2 ]] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R $1 | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $2 ]; then echo $f; fi done
I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so:
sudo sh dir_diff.sh /tmp 1 drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21 drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301 drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput
On my 2.5 Sandbox, it returns this
sh dir_diff.sh /tmp 10 drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db
Once you get a list of those files, you can issue
hdfs dfs -mv file newdir
We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.
Created 03-13-2017 06:11 PM
here's a slightly modified script from stack overflow thread
#!/bin/bash usage="Usage: dir_diff.sh [directory] [days]" if [[ $# -ne 2 ]] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R $1 | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $2 ]; then echo $f; fi done
I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so:
sudo sh dir_diff.sh /tmp 1 drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21 drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301 drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput
On my 2.5 Sandbox, it returns this
sh dir_diff.sh /tmp 10 drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db
Once you get a list of those files, you can issue
hdfs dfs -mv file newdir
We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.
Created 03-14-2017 10:36 AM
Thanks a lot for info...Let me try the same and will get back
Created 03-15-2017 01:02 PM
Thanks a lot @Artem Ervits It is working fine....
Created 03-16-2017 02:44 PM
Hi @Artem Ervits..But for moving files we need manual effort as script only showing old data. Please advice
Created 03-16-2017 02:51 PM
to do this on continuous basis you either need to setup an Oozie job that will run a script to determine old data and move it to new location. Alternatively, you can use Apache Nifi by watching a directory for old data and move it to new location. There's nothing out of the box that will do that for you.
Created 03-13-2017 08:07 PM
You can also use apache falcon and build data retention policies for hdfs.
Created 03-14-2017 10:37 AM
But falcon feed only look for feed path which manually created . But it never does any operation on unix time stamp.
Created 03-15-2017 04:53 AM
the date shown when do
hdfs dfs -ls <directory_location> actually shows the date when the file is placed in HDFS. Even though if the file is updated with INSERT option using hive command, the date doesn't seem to be changed.
Example: the file placed in HDFS about 10 days back, and though the file altered today, the date remain as the original one.
Created 03-15-2017 12:55 PM
to change an actual date of file you need to rewrite it. That was not the original question as far as I understand. Please open a new question with exact requirements.