Support Questions

Find answers, ask questions, and share your expertise

read/write hdfs files with standalone python script

avatar
Explorer

Hello,

 

I have some python standalone files, which acces data through the common command:

with open("filename") as f:
   for lines in f:
[...]

 

I want make the python scripts able to run, without changing too much of the code and without dependencies, if possible. Right now I start the files as spark-programms in the Workflow in HUE.

Are there built-in packages I can use? I tried to import pydoop and hdfs, but they didnt exist.

 

My goal is to make these scripts run and be able to read/write files on the HDFS.

 

Thanks for the help.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello Creaping,

As HDFS is not a standard unix filesystem, it is not possible to read it with native python IO libraries. As HDFS is open-source, there are plenty of connectors out there. You can also acces HDFS via HttpFS on a REST interface.

In case you'd like to parse large amount of data, none of that will be suitable, as the script itself still runs on a single computer. To solve that, you can use pyspark to rewrite your script and use the spark-provided utils to manipulate data. This'll solve both HDFS access and the distribution of the workload for you.

Zsolt

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Hello Creaping,

As HDFS is not a standard unix filesystem, it is not possible to read it with native python IO libraries. As HDFS is open-source, there are plenty of connectors out there. You can also acces HDFS via HttpFS on a REST interface.

In case you'd like to parse large amount of data, none of that will be suitable, as the script itself still runs on a single computer. To solve that, you can use pyspark to rewrite your script and use the spark-provided utils to manipulate data. This'll solve both HDFS access and the distribution of the workload for you.

Zsolt

avatar
Explorer

Hello Zsolt,

 

thanks for the reply. The problem was, that I don't have the permissions to install python packages like pydoop.

I was not sure if there is a native way, but I will ask the sysadmin to install some packages.