Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

PySpark Logging to HDFS instead of local filesystem

avatar
Explorer

I would like to use Pythons Logging library, but want the output of the logs to land in HDFS instead of the local file system for the worker node.  Is there a way to do that? 

My code for setting up logging is below:

 

import logging
logging.basicConfig(filename='/var/log/DataFramedriversRddConvert.log',level=logging.DEBUG)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.info('++++Started DataFramedriversRddConvert++++')

1 ACCEPTED SOLUTION

avatar
Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

View solution in original post

2 REPLIES 2

avatar
Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

avatar
New Contributor

This is not working.  Please let me know how to use full path