Created on 06-01-2019 05:51 PM - edited 06-01-2019 05:53 PM
Hello guys,
I hope that I post in the right section.
I have to following python script(I managed to run it locally):
#!/usr/bin/env python3 import folderstats df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True) df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)
I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.
(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera Found 1 items drwxrwxrwx - cloudera cloudera 0 2019-06-01 13:30 /user/cloudera/files
The problem is that the directory can't be accessed.
:I tried to run it normally: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:
Traceback (most recent call last): File "script.py", line 5, in <module> df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True) File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats verbose=verbose) File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats for f in os.listdir(folderpath): FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'
"No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'"
Can you, please, help me to clarify this issue?
Thank you!
Created on 06-03-2019 12:40 PM - edited 06-03-2019 12:42 PM
It looks like you are pointing to the HDFS ../cloudera/files folder when you declare df.
However, when you try to convert df to a csv, you are pointing to ../cloudera/files.csv, not the "files" folder.
Maybe this is causing the issue?
EDIT: I accidently pressed the "Me Too" button. How do I uncheck it?
Created 06-03-2019 01:21 PM
Created 06-03-2019 03:58 PM
Hi @VladTheLad ,
I have done some research and I am wondering if the issue is caused by the Python module you are using may not work well with HDFS directory.
I found a couple of resources which I think could help in this situation:
and
Thanks and hope above may help.
Li
Li Wang, Technical Solution Manager
Created 06-03-2019 04:19 PM
Hello @lwang
Thank you for your reply!
I already tried these options.
The first one using subprocesses and trying to run some hdfs commands could be an option but I am not very familiar with how to obtain the metadata I need: file_extension, creation_time, etc.
The second link is more about how to read/write a specific file, for example, .txt files.
I basically want to access a location(directory) in HDFS, iterate over all files inside and extract metadata about the files.
If I find a working solution I can forget about that "folderstats" module and do it in another way.
Created 06-04-2019 09:57 PM
Hi @VladTheLad ,
You probably can explore different options of ls command from hdfs:
# hdfs dfs -help ls -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] : List the contents that match the specified file pattern. If path is not specified, the contents of /user/<currentUser> will be listed. For a directory a list of its direct children is returned (unless -d option is specified). Directory entries are of the form: permissions - userId groupId sizeOfDirectory(in bytes) modificationDate(yyyy-MM-dd HH:mm) directoryName and file entries are of the form: permissions numberOfReplicas userId groupId sizeOfFile(in bytes) modificationDate(yyyy-MM-dd HH:mm) fileName -C Display the paths of files and directories only. -d Directories are listed as plain files. -h Formats the sizes of files in a human-readable fashion rather than a number of bytes. -q Print ? instead of non-printable characters. -R Recursively list the contents of directories. -t Sort files by modification time (most recent first). -S Sort files by size. -r Reverse the order of the sort. -u Use time of last access instead of modification for display and sorting. -e Display the erasure coding policy of files and directories.
Thanks,
Li
Li Wang, Technical Solution Manager