Support Questions

Find answers, ask questions, and share your expertise

Can't access directory from HDFS inside a Python script

avatar
New Contributor

Hello guys,

 

I hope that I post in the right section.

 

I have to following python script(I managed to run it locally):

 

 

#!/usr/bin/env python3

import folderstats

df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)

df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)

 

 

I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.

 

(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - cloudera cloudera          0 2019-06-01 13:30 /user/cloudera/files

The problem is that the directory can't be accessed.

:I tried to run it normally: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:

 

 

Traceback (most recent call last):
  File "script.py", line 5, in <module>
    df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats
    verbose=verbose)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats
    for f in os.listdir(folderpath):
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'

"No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'"

 

Can you, please, help me to clarify this issue?

 

Thank you!

 

 

 

5 REPLIES 5

avatar
New Contributor

It looks like you are pointing to the HDFS ../cloudera/files folder when you declare df.

However, when you try to convert df to a csv, you are pointing to ../cloudera/files.csv, not the "files" folder.

 

Maybe this is causing the issue?

EDIT: I accidently pressed the "Me Too" button. How do I uncheck it?

avatar
New Contributor
That is the location where I want the csv File to be generated.
It doesn’t even get to that line.
The script can not Access the directory located in hdfs://quickstart.cloudera:8020/user/cloudera/files

avatar
Guru

Hi @VladTheLad ,

 

I have done some research and I am wondering if the issue is caused by the Python module you are using may not work well with HDFS directory.


I found a couple of resources which I think could help in this situation:

https://community.hortonworks.com/articles/92321/interacting-with-hadoop-hdfs-using-python-codes.htm...

and

https://creativedata.atlassian.net/wiki/spaces/SAP/pages/61177860/Python+-+Read+Write+files+from+HDF...

 

Thanks and hope above may help.

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

avatar
New Contributor

Hello @lwang 

 

Thank you for your reply!

 

I already tried these options.

 

The first one using subprocesses and trying to run some hdfs commands could be an option but I am not very familiar with how to obtain the metadata I need: file_extension, creation_time, etc.

 

The second link is more about how to read/write a specific file, for example, .txt files.

 

I basically want to access a location(directory) in HDFS, iterate over all files inside and extract metadata about the files.

 

If I find a working solution I can forget about that "folderstats" module and do it in another way.

avatar
Guru

Hi @VladTheLad ,

 

You probably can explore different options of ls command from hdfs:

# hdfs dfs -help ls
-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] :
  List the contents that match the specified file pattern. If path is not
  specified, the contents of /user/<currentUser> will be listed. For a directory a
  list of its direct children is returned (unless -d option is specified).

  Directory entries are of the form:
  	permissions - userId groupId sizeOfDirectory(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) directoryName

  and file entries are of the form:
  	permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) fileName

    -C  Display the paths of files and directories only.
    -d  Directories are listed as plain files.
    -h  Formats the sizes of files in a human-readable fashion
        rather than a number of bytes.
    -q  Print ? instead of non-printable characters.
    -R  Recursively list the contents of directories.
    -t  Sort files by modification time (most recent first).
    -S  Sort files by size.
    -r  Reverse the order of the sort.
    -u  Use time of last access instead of modification for
        display and sorting.
    -e  Display the erasure coding policy of files and directories.

Thanks,

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum