Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PySpark Logging to HDFS instead of local filesystem

Solved Go to solution

PySpark Logging to HDFS instead of local filesystem

Explorer

I would like to use Pythons Logging library, but want the output of the logs to land in HDFS instead of the local file system for the worker node.  Is there a way to do that? 

My code for setting up logging is below:

 

import logging
logging.basicConfig(filename='/var/log/DataFramedriversRddConvert.log',level=logging.DEBUG)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.info('++++Started DataFramedriversRddConvert++++')

1 ACCEPTED SOLUTION

Accepted Solutions

Re: PySpark Logging to HDFS instead of local filesystem

Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs

1 REPLY 1

Re: PySpark Logging to HDFS instead of local filesystem

Champion

@aj

 

You can achive this by giving fully qualified path. 

 

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>  

 

## To use Local path
file:///home/<path>

 

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default. 

2. If HDFS goes down, you cannot check the logs