- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 03-31-2017 07:42 PM
Interacting with Hadoop HDFS using Python codes
This post will go through the following:
- Introducing python “subprocess” module
- Running HDFS commands with Python
- Examples of HDFS commands from Python
1-Introducing python “subprocess” module
The Python “subprocess” module allows us to:
- spawn new Unix processes
- connect to their input/output/error pipes
- obtain their return codes
To run UNIX commands we need to create a subprocess that runs the command. The recommended approach to invoking subprocesses is to use the convenience functions for all use cases they can handle. Or we can use the underlying Popen interface can be used directly.
2-Running HDFS commands with Python
We will create a Python function called run_cmd that will effectively allow us to run any unix or linux commands or in our case hdfs dfs commands as linux pipe capturing stdout and stderr and piping the input as list of arguments of the elements of the native unix or HDFS command. It is passed as a Python list rather than a string of characters as you don't have to parse or escape characters.
# import the python subprocess module import subprocess def run_cmd(args_list): """ run linux commands """ # import subprocess print('Running system command: {0}'.format(' '.join(args_list))) proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE) s_output, s_err = proc.communicate() s_return = proc.returncode return s_return, s_output, s_err
3-Examples of HDFS commands from Python
Run Hadoop ls command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path']) lines = out.split('\n') Run Hadoop get command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path']) Run Hadoop put command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path']) Run Hadoop copyFromLocal command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path']) Run Hadoop copyToLocal command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file']) hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently Run Hadoop remove file command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path']) (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path']) rm -r HDFS Command to remove the entire directory and all of its content from HDFS. Usage: hdfs dfs -rm -r <path> (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path']) (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path']) Check if a file exist in HDFS Usage: hadoop fs -test -[defsz] URI Options: -d: f the path is a directory, return 0. -e: if the path exists, return 0. -f: if the path is a file, return 0. -s: if the path is not empty, return 0. -z: if the file is zero length, return 0. Example: hadoop fs -test -e filename hdfs_file_path = '/tmpo' cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path] ret, out, err = run_cmd(cmd) print(ret, out, err) if ret: print('file does not exist')
These simple but very powerful lines of code allow to interact with HDFS in a programmatic way and can be easily scheduled as part of schedule cron jobs.
Created on 04-03-2017 01:41 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
In case you want to leverage structured results from HDFS commands or further reduce latency / overhead, also have a look at "snakebite", which is a pure python implementation of HDFS client functionality:
https://github.com/spotify/snakebite
https://community.hortonworks.com/articles/26416/how-to-install-snakebite-in-hdp.html
Created on 04-03-2017 07:09 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Thanks for the comment Michael. I wrote these commands for hdp environments using standard python 2.7 where we can not do a pip install of snakebite. (i.e. hdp clusters are behind the firewall in secure zone with no pip download allowed)
Created on 01-30-2018 10:21 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
It was an excellent article on interacting of Hadoop HDFS using Python to hear from you which is very useful. thank you so much for gathering all this information in one post with examples, and it will be extremely helpful for all people.
Created on 12-17-2019 07:54 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi All.
Here is all steps for doing same!!!
Link :-
https://www.oreilly.com/library/view/hadoop-with-python/9781492048435/ch01.html
Thanks
HadoopHelp
Created on 05-26-2020 08:14 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi,
Thank you for this great article and code snippets. its really useful. i just face one problem sometimes while executing the commands that it gives OSError: arguments list too long. can you or someone please help in this regard, how can i resolve this problem my run_command looks like as follows:
run_cmd(['/usr/bin/hdfs', 'dfs', '-copyToLocal', 'hdfs://namenode:port/path/to/file', 'localPath'])
thanks for any help in advance.
Regards,
Wasif
Created on 05-26-2020 10:57 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
As this is an old article, you would have a better chance of receiving a useful response by starting a new thread. This will also provide you with the opportunity to provide details specific to your issue that could aid others in providing a more tailored answer to your question.
Regards,
Vidya
Created on
05-26-2020
11:52 PM
- last edited on
05-27-2020
12:31 AM
by
VidyaSargur
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@VidyaSargur Thank you for the response and the suggestion, i will create a new thread for my problem.
Edit:
i have created my new question here
Thanks and regards,
Wasif