Reply
aj
Explorer
Posts: 8
Registered: ‎06-07-2016
Accepted Solution

PySpark Logging to HDFS instead of local filesystem

I would like to use Pythons Logging library, but want the output of the logs to land in HDFS instead of the local file system for the worker node.  Is there a way to do that? 

My code for setting up logging is below:

 

import logging
logging.basicConfig(filename='/var/log/DataFramedriversRddConvert.log',level=logging.DEBUG)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.info('++++Started DataFramedriversRddConvert++++')

Highlighted
Posts: 376
Topics: 11
Kudos: 58
Solutions: 32
Registered: ‎09-02-2016

Re: PySpark Logging to HDFS instead of local filesystem

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

Announcements