Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

PySpark Logging to HDFS instead of local filesystem

avatar
Explorer

I would like to use Pythons Logging library, but want the output of the logs to land in HDFS instead of the local file system for the worker node.  Is there a way to do that? 

My code for setting up logging is below:

 

import logging
logging.basicConfig(filename='/var/log/DataFramedriversRddConvert.log',level=logging.DEBUG)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.info('++++Started DataFramedriversRddConvert++++')

1 ACCEPTED SOLUTION

avatar
Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

View solution in original post

2 REPLIES 2

avatar
Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

avatar
New Member

This is not working.  Please let me know how to use full path