Support Questions

Grg · ‎09-19-2015

Hello,

When writing a file to HDFS, from a Spark application in Scala, I cannot find a way to limit the HDFS resources to be used.

I know I can use an Hadoop confifuration for my Hadoop FileSystem object, that will be used for data manipulation such as deleting a file. Is there a way to say it that, even if I have 3 datanodes and even if each writen file should be distributed to at least 2 partitions, I would like to enforce it to be qplitted and distributed on 3 partitions and datanodes?

I would like to be able to do this programmatically, and not to configure tha Hadoop cluster and restart it... What would impact all Spark applications.

Thanks in advance for your feedback 🙂

srowen · ‎09-19-2015

Replications is an HDFS-level configuration. It isn't something you
configure from Spark, and you don't have to worry about it from Spark.
AFAIK you set a global replication factor, but can set it per
directory too. I think you want to pursue this via HDFS.

View solution in original post

srowen · ‎09-19-2015

Replications is an HDFS-level configuration. It isn't something you
configure from Spark, and you don't have to worry about it from Spark.
AFAIK you set a global replication factor, but can set it per
directory too. I think you want to pursue this via HDFS.

Cloudera Community

Support Questions

Write file to HDFS: limit number of datanodes to be used

Writing parquet on HDFS using Spark Streaming

Writing files to Cloudera Machine Learning using A...

How to limit the size of ranger log and number of ...

How to write to HDFS remotely using pandas

Read/Write throughput HDFS JBOD disk

How to Move or Change HDFS DataNode Directories

Write Spark HQL Query output to HDFS

Datanode low number of blocks

Garbage Collection Pauses in Namenode and Datanode

Flume: HDFS sink: Can't write large files