Support Questions

Find answers, ask questions, and share your expertise

All Hdfs file names older than N days

avatar
New Contributor

Dears,

 

Greetings!

I have the requirement to get the all HDFS filename which are older than N days.

I got the all last level directory which are older But requirement is all filenames.

Kindly support with hdfs command or script or code for the same. 

 

also suggest on if we query any Hive table and done some sum on one column or join with other table Is this will change the timestamp of underlining hdfs file of respective Hive table. obviously write or update to table will change the timestamp of respective hdfs file.

 

Appreciate swift response

3 ACCEPTED SOLUTIONS

avatar
Master Collaborator

Hi @DA-Ka

Below example is inspired by this link

1)  use -t -R to list files recursively with timestamp:

# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07
drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1
drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0

2) filter the files older than a timestamp:

sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'

# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0

 

Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps.

 

Regards,

Will

If the answer helps, please accept as solution and click thumbs up.

View solution in original post

avatar
Master Collaborator

Hi @DA-Ka,

SUM and JOIN won't change the timestamp of the underlying file.

Example:

create table mytable (i int,j int,k int);
insert into mytable values (1,2,3),(4,5,6),(7,8,9);
create table mytable2 (i int,j int,k int);
insert into mytable2 values (1,2,6),(3,5,7),(4,8,9);

select * from mytable;
+------------+------------+------------+
| mytable.i | mytable.j | mytable.k |
+------------+------------+------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+------------+------------+------------+

select * from mytable2;
+-------------+-------------+-------------+
| mytable2.i | mytable2.j | mytable2.k |
+-------------+-------------+-------------+
| 1 | 2 | 6 |
| 3 | 5 | 7 |
| 4 | 8 | 9 |
+-------------+-------------+-------------+

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0

1. Sum, timestamp is unchanged

select pos+1 as col,sum (val) as sum_col
from mytable t lateral view posexplode(array(*)) pe
group by pos;

+------+----------+
| col | sum_col |
+------+----------+
| 2 | 15 |
| 1 | 12 |
| 3 | 18 |
+------+----------+

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0

2. Inner Join, timestamp is unchanged

select * from
(select * from mytable)T1
join
(select * from mytable2)T2
on T1.i=T2.i

+-------+-------+-------+-------+-------+-------+
| t1.i | t1.j | t1.k | t2.i | t2.j | t2.k |
+-------+-------+-------+-------+-------+-------+
| 1 | 2 | 3 | 1 | 2 | 6 |
| 4 | 5 | 6 | 4 | 8 | 9 |
+-------+-------+-------+-------+-------+-------+

sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0

 

Regards,

Will

View solution in original post

avatar
Expert Contributor

@DA-Ka You need to use HDFS Find tool  "org.apache.solr.hadoop.HdfsFindTool" for that purpose.

Refer below links which suggests some method to fid the old Files.

- http://35.204.180.114/static/help/topics/search_hdfsfindtool.html

 

However, the search-based HDFS find tool has been removed and is superseded in CDH 6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find

View solution in original post

3 REPLIES 3

avatar
Master Collaborator

Hi @DA-Ka

Below example is inspired by this link

1)  use -t -R to list files recursively with timestamp:

# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07
drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1
drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0

2) filter the files older than a timestamp:

sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'

# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0

 

Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps.

 

Regards,

Will

If the answer helps, please accept as solution and click thumbs up.

avatar
Master Collaborator

Hi @DA-Ka,

SUM and JOIN won't change the timestamp of the underlying file.

Example:

create table mytable (i int,j int,k int);
insert into mytable values (1,2,3),(4,5,6),(7,8,9);
create table mytable2 (i int,j int,k int);
insert into mytable2 values (1,2,6),(3,5,7),(4,8,9);

select * from mytable;
+------------+------------+------------+
| mytable.i | mytable.j | mytable.k |
+------------+------------+------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+------------+------------+------------+

select * from mytable2;
+-------------+-------------+-------------+
| mytable2.i | mytable2.j | mytable2.k |
+-------------+-------------+-------------+
| 1 | 2 | 6 |
| 3 | 5 | 7 |
| 4 | 8 | 9 |
+-------------+-------------+-------------+

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0

1. Sum, timestamp is unchanged

select pos+1 as col,sum (val) as sum_col
from mytable t lateral view posexplode(array(*)) pe
group by pos;

+------+----------+
| col | sum_col |
+------+----------+
| 2 | 15 |
| 1 | 12 |
| 3 | 18 |
+------+----------+

# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0

2. Inner Join, timestamp is unchanged

select * from
(select * from mytable)T1
join
(select * from mytable2)T2
on T1.i=T2.i

+-------+-------+-------+-------+-------+-------+
| t1.i | t1.j | t1.k | t2.i | t2.j | t2.k |
+-------+-------+-------+-------+-------+-------+
| 1 | 2 | 3 | 1 | 2 | 6 |
| 4 | 5 | 6 | 4 | 8 | 9 |
+-------+-------+-------+-------+-------+-------+

sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0

 

Regards,

Will

avatar
Expert Contributor

@DA-Ka You need to use HDFS Find tool  "org.apache.solr.hadoop.HdfsFindTool" for that purpose.

Refer below links which suggests some method to fid the old Files.

- http://35.204.180.114/static/help/topics/search_hdfsfindtool.html

 

However, the search-based HDFS find tool has been removed and is superseded in CDH 6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find