- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Can't access directory from HDFS inside a Python script
- Labels:
-
HDFS
Created on ‎06-01-2019 05:51 PM - edited ‎06-01-2019 05:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello guys,
I hope that I post in the right section.
I have to following python script(I managed to run it locally):
#!/usr/bin/env python3 import folderstats df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True) df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)
I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.
(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera Found 1 items drwxrwxrwx - cloudera cloudera 0 2019-06-01 13:30 /user/cloudera/files
The problem is that the directory can't be accessed.
:I tried to run it normally: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:
Traceback (most recent call last): File "script.py", line 5, in <module> df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True) File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats verbose=verbose) File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats for f in os.listdir(folderpath): FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'
"No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'"
Can you, please, help me to clarify this issue?
Thank you!
Created on ‎06-03-2019 12:40 PM - edited ‎06-03-2019 12:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are pointing to the HDFS ../cloudera/files folder when you declare df.
However, when you try to convert df to a csv, you are pointing to ../cloudera/files.csv, not the "files" folder.
Maybe this is causing the issue?
EDIT: I accidently pressed the "Me Too" button. How do I uncheck it?
Created ‎06-03-2019 01:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It doesn’t even get to that line.
The script can not Access the directory located in hdfs://quickstart.cloudera:8020/user/cloudera/files
Created ‎06-03-2019 03:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @VladTheLad ,
I have done some research and I am wondering if the issue is caused by the Python module you are using may not work well with HDFS directory.
I found a couple of resources which I think could help in this situation:
and
Thanks and hope above may help.
Li
Li Wang, Technical Solution Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created ‎06-03-2019 04:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @lwang
Thank you for your reply!
I already tried these options.
The first one using subprocesses and trying to run some hdfs commands could be an option but I am not very familiar with how to obtain the metadata I need: file_extension, creation_time, etc.
The second link is more about how to read/write a specific file, for example, .txt files.
I basically want to access a location(directory) in HDFS, iterate over all files inside and extract metadata about the files.
If I find a working solution I can forget about that "folderstats" module and do it in another way.
Created ‎06-04-2019 09:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @VladTheLad ,
You probably can explore different options of ls command from hdfs:
# hdfs dfs -help ls -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] : List the contents that match the specified file pattern. If path is not specified, the contents of /user/<currentUser> will be listed. For a directory a list of its direct children is returned (unless -d option is specified). Directory entries are of the form: permissions - userId groupId sizeOfDirectory(in bytes) modificationDate(yyyy-MM-dd HH:mm) directoryName and file entries are of the form: permissions numberOfReplicas userId groupId sizeOfFile(in bytes) modificationDate(yyyy-MM-dd HH:mm) fileName -C Display the paths of files and directories only. -d Directories are listed as plain files. -h Formats the sizes of files in a human-readable fashion rather than a number of bytes. -q Print ? instead of non-printable characters. -R Recursively list the contents of directories. -t Sort files by modification time (most recent first). -S Sort files by size. -r Reverse the order of the sort. -u Use time of last access instead of modification for display and sorting. -e Display the erasure coding policy of files and directories.
Thanks,
Li
Li Wang, Technical Solution Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
