Support Questions

Find answers, ask questions, and share your expertise

How to move HDFS files from one directory to other directory which are 10days old

avatar

Hi @Artem Ervits can you help here

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Amit Panda

here's a slightly modified script from stack overflow thread

#!/bin/bash
usage="Usage: dir_diff.sh [directory] [days]"

if [[ $# -ne 2 ]]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
hadoop fs -ls -R $1 | grep "^d" | while read f; do
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $2 ]; then
    echo $f;
  fi
done

I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so:

sudo sh dir_diff.sh /tmp 1
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging
drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db
drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput

On my 2.5 Sandbox, it returns this

sh dir_diff.sh /tmp 10
drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active
drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel
drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs
drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive
drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa
drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db

Once you get a list of those files, you can issue

hdfs dfs -mv file newdir

We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.

View solution in original post

9 REPLIES 9

avatar
Master Mentor

@Amit Panda

here's a slightly modified script from stack overflow thread

#!/bin/bash
usage="Usage: dir_diff.sh [directory] [days]"

if [[ $# -ne 2 ]]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
hadoop fs -ls -R $1 | grep "^d" | while read f; do
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $2 ]; then
    echo $f;
  fi
done

I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so:

sudo sh dir_diff.sh /tmp 1
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging
drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db
drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput

On my 2.5 Sandbox, it returns this

sh dir_diff.sh /tmp 10
drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active
drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel
drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs
drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive
drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa
drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db

Once you get a list of those files, you can issue

hdfs dfs -mv file newdir

We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.

avatar

Thanks a lot for info...Let me try the same and will get back

avatar

Thanks a lot @Artem Ervits It is working fine....

avatar

Hi @Artem Ervits..But for moving files we need manual effort as script only showing old data. Please advice

avatar
Master Mentor
@Amit Panda

to do this on continuous basis you either need to setup an Oozie job that will run a script to determine old data and move it to new location. Alternatively, you can use Apache Nifi by watching a directory for old data and move it to new location. There's nothing out of the box that will do that for you.

avatar
Master Guru

You can also use apache falcon and build data retention policies for hdfs.

avatar

But falcon feed only look for feed path which manually created . But it never does any operation on unix time stamp.

avatar
Contributor

hi @Artem Ervits

the date shown when do

hdfs dfs -ls <directory_location> actually shows the date when the file is placed in HDFS. Even though if the file is updated with INSERT option using hive command, the date doesn't seem to be changed.

Example: the file placed in HDFS about 10 days back, and though the file altered today, the date remain as the original one.

avatar
Master Mentor

to change an actual date of file you need to rewrite it. That was not the original question as far as I understand. Please open a new question with exact requirements.