Member since
07-22-2017
15
Posts
0
Kudos Received
0
Solutions
10-07-2017
10:17 AM
Below two steps to delete the output directory(not recommended) in MapReduce: 1) using shell: bin/hadoop dfs -rmr /path/to/your/output/ 2) JAVA API: // configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path(”/path/to/your/output”), true); If you want to override the existing: Need to overwrite the Hadoop OutputFormat class: public class OverwriteOutputDirOutputFile extends TextOutputFormat{
@Override
public void checkOutputSpecs(FileSystem ignored, JobConf job)
throws FileAlreadyExistsException,
InvalidJobConfException, IOException {
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null && job.getNumReduceTasks() != 0) {
throw new InvalidJobConfException(”Output directory not set in JobConf.”);
}
if (outDir != null) {
FileSystem fs = outDir.getFileSystem(job);
// normalize the output directory
outDir = fs.makeQualified(outDir);
setOutputPath(job, outDir);
// get delegation token for the outDir’s file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] {outDir}, job);
// check its existence
/* if (fs.exists(outDir)) {
throw new FileAlreadyExistsException(”Output directory ” + outDir +
” already exists”);
}*/
}
} }
... View more
09-27-2017
05:40 AM
HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.
... View more
09-21-2017
12:32 PM
Apache Spark –Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. To know more about Spark refer below link: Spark tutorial for beginners
... View more
09-06-2017
07:15 AM
Labels:
- Labels:
-
Apache Hadoop
08-24-2017
04:17 AM
Labels:
- Labels:
-
Apache Spark
08-21-2017
05:41 AM
Labels:
- Labels:
-
Apache Spark