Created 10-19-2021 11:28 AM
Dears,
Greetings!
I have the requirement to get the all HDFS filename which are older than N days.
I got the all last level directory which are older But requirement is all filenames.
Kindly support with hdfs command or script or code for the same.
also suggest on if we query any Hive table and done some sum on one column or join with other table Is this will change the timestamp of underlining hdfs file of respective Hive table. obviously write or update to table will change the timestamp of respective hdfs file.
Appreciate swift response
Created 10-20-2021 01:01 AM
Hi @DA-Ka,
Below example is inspired by this link
1) use -t -R to list files recursively with timestamp:
# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07
drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1
drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0
2) filter the files older than a timestamp:
sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0
Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps.
Regards,
Will
If the answer helps, please accept as solution and click thumbs up.
Created 10-20-2021 08:44 AM
Hi @DA-Ka,
SUM and JOIN won't change the timestamp of the underlying file.
Example:
create table mytable (i int,j int,k int);
insert into mytable values (1,2,3),(4,5,6),(7,8,9);
create table mytable2 (i int,j int,k int);
insert into mytable2 values (1,2,6),(3,5,7),(4,8,9);
select * from mytable;
+------------+------------+------------+
| mytable.i | mytable.j | mytable.k |
+------------+------------+------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+------------+------------+------------+
select * from mytable2;
+-------------+-------------+-------------+
| mytable2.i | mytable2.j | mytable2.k |
+-------------+-------------+-------------+
| 1 | 2 | 6 |
| 3 | 5 | 7 |
| 4 | 8 | 9 |
+-------------+-------------+-------------+
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0
1. Sum, timestamp is unchanged
select pos+1 as col,sum (val) as sum_col
from mytable t lateral view posexplode(array(*)) pe
group by pos;
+------+----------+
| col | sum_col |
+------+----------+
| 2 | 15 |
| 1 | 12 |
| 3 | 18 |
+------+----------+
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
2. Inner Join, timestamp is unchanged
select * from
(select * from mytable)T1
join
(select * from mytable2)T2
on T1.i=T2.i
+-------+-------+-------+-------+-------+-------+
| t1.i | t1.j | t1.k | t2.i | t2.j | t2.k |
+-------+-------+-------+-------+-------+-------+
| 1 | 2 | 3 | 1 | 2 | 6 |
| 4 | 5 | 6 | 4 | 8 | 9 |
+-------+-------+-------+-------+-------+-------+
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0
Regards,
Will
Created 10-21-2021 12:49 AM
@DA-Ka You need to use HDFS Find tool "org.apache.solr.hadoop.HdfsFindTool" for that purpose.
Refer below links which suggests some method to fid the old Files.
- http://35.204.180.114/static/help/topics/search_hdfsfindtool.html
However, the search-based HDFS find tool has been removed and is superseded in CDH 6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find
Created 10-20-2021 01:01 AM
Hi @DA-Ka,
Below example is inspired by this link
1) use -t -R to list files recursively with timestamp:
# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07
drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1
drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0
2) filter the files older than a timestamp:
sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
# sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}'
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000
2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0
Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps.
Regards,
Will
If the answer helps, please accept as solution and click thumbs up.
Created 10-20-2021 08:44 AM
Hi @DA-Ka,
SUM and JOIN won't change the timestamp of the underlying file.
Example:
create table mytable (i int,j int,k int);
insert into mytable values (1,2,3),(4,5,6),(7,8,9);
create table mytable2 (i int,j int,k int);
insert into mytable2 values (1,2,6),(3,5,7),(4,8,9);
select * from mytable;
+------------+------------+------------+
| mytable.i | mytable.j | mytable.k |
+------------+------------+------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+------------+------------+------------+
select * from mytable2;
+-------------+-------------+-------------+
| mytable2.i | mytable2.j | mytable2.k |
+-------------+-------------+-------------+
| 1 | 2 | 6 |
| 3 | 5 | 7 |
| 4 | 8 | 9 |
+-------------+-------------+-------------+
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0
1. Sum, timestamp is unchanged
select pos+1 as col,sum (val) as sum_col
from mytable t lateral view posexplode(array(*)) pe
group by pos;
+------+----------+
| col | sum_col |
+------+----------+
| 2 | 15 |
| 1 | 12 |
| 3 | 18 |
+------+----------+
# sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
2. Inner Join, timestamp is unchanged
select * from
(select * from mytable)T1
join
(select * from mytable2)T2
on T1.i=T2.i
+-------+-------+-------+-------+-------+-------+
| t1.i | t1.j | t1.k | t2.i | t2.j | t2.k |
+-------+-------+-------+-------+-------+-------+
| 1 | 2 | 3 | 1 | 2 | 6 |
| 4 | 5 | 6 | 4 | 8 | 9 |
+-------+-------+-------+-------+-------+-------+
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable
drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0
sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2
drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000
-rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0
Regards,
Will
Created 10-21-2021 12:49 AM
@DA-Ka You need to use HDFS Find tool "org.apache.solr.hadoop.HdfsFindTool" for that purpose.
Refer below links which suggests some method to fid the old Files.
- http://35.204.180.114/static/help/topics/search_hdfsfindtool.html
However, the search-based HDFS find tool has been removed and is superseded in CDH 6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find