Member since
08-09-2017
9
Posts
0
Kudos Received
0
Solutions
10-07-2017
10:17 AM
Below two steps to delete the output directory(not recommended) in MapReduce: 1) using shell: bin/hadoop dfs -rmr /path/to/your/output/ 2) JAVA API: // configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path(”/path/to/your/output”), true); If you want to override the existing: Need to overwrite the Hadoop OutputFormat class: public class OverwriteOutputDirOutputFile extends TextOutputFormat{
@Override
public void checkOutputSpecs(FileSystem ignored, JobConf job)
throws FileAlreadyExistsException,
InvalidJobConfException, IOException {
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null && job.getNumReduceTasks() != 0) {
throw new InvalidJobConfException(”Output directory not set in JobConf.”);
}
if (outDir != null) {
FileSystem fs = outDir.getFileSystem(job);
// normalize the output directory
outDir = fs.makeQualified(outDir);
setOutputPath(job, outDir);
// get delegation token for the outDir’s file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] {outDir}, job);
// check its existence
/* if (fs.exists(outDir)) {
throw new FileAlreadyExistsException(”Output directory ” + outDir +
” already exists”);
}*/
}
} }
... View more
09-27-2017
05:40 AM
HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.
... View more
08-25-2017
06:54 AM
1 Kudo
@Sheetal Sharma LongWritable is the WritableComparable for longs, Similarly IntWritable is a WritableComparable for ints. These interfaces [1] & [2] are all necessary for Hadoop/MapReduce, as the Comparable interface is used for comparing when the reducer sorts the keys, and Writable can write the result to the local disk. It does not use the java Serializable because java Serializable is too big or too heavy for hadoop, Writable can serializable the hadoop Object in a very light way. [1] https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/LongWritable.html [2] https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/IntWritable.html#IntWritable() "Comparable" is the interface whose abstract methods give us the flexibility to compare two objects. "Writable" is meant for writing the data to local disk and it's a serialization format. One can implement own Writables in Hadoop. Java’s serialization is too bulky and slow on the system. That’s why Hadoop community had put Writable in place. "WritableComparable" is a combination of the above two interfaces. "int" is a primitive type so it cannot be used as key-value. Integer is the wrapper class around it. So I’ll correct your question that what is the difference between Integer and IntWritable? "IntWritable" is the Hadoop variant of Integer which has been optimized for serialization in the Hadoop environment. An integer would use the default Java Serialization which is very costly in Hadoop environment. .
... View more