question Re: PySpark Logging to HDFS instead of local filesystem in Archives of Support Questions (Read Only)

PySpark Logging to HDFS instead of local filesystem

aj — Fri, 16 Sep 2022 11:20:47 GMT

I would like to use Pythons Logging library, but want the output of the logs to land in HDFS instead of the local file system for the worker node. Is there a way to do that?

My code for setting up logging is below:

import logging
logging.basicConfig(filename='/var/log/DataFramedriversRddConvert.log',level=logging.DEBUG)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.info('++++Started DataFramedriversRddConvert++++')

Re: PySpark Logging to HDFS instead of local filesystem

saranvisa — Mon, 27 Mar 2017 19:30:34 GMT

@aj

You can achive this by giving fully qualified path.

## To use HDFS path

hdfs://<cluster-node>:8020/user/<path>

## To use Local path
file:///home/<path>

Some additional Notes: It is not recommended to have logs in HDFS for two reasons

1. HDFS maintains 3 replication factors by default.

2. If HDFS goes down, you cannot check the logs

Re: PySpark Logging to HDFS instead of local filesystem

KGF — Tue, 08 Sep 2020 17:33:46 GMT

This is not working. Please let me know how to use full path