About laiba

shreyag1207 · ‎10-07-2017

Below two steps to delete the output directory(not recommended) in MapReduce: 1) using shell: bin/hadoop dfs -rmr /path/to/your/output/ 2) JAVA API: // configuration should contain reference to your namenode FileSystem fs = FileSystem.get(new Configuration()); // true stands for recursively deleting the folder you gave fs.delete(new Path(”/path/to/your/output”), true); If you want to override the existing: Need to overwrite the Hadoop OutputFormat class: public class OverwriteOutputDirOutputFile extends TextOutputFormat{ @Override public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException { // Ensure that the output directory is set and not already there Path outDir = getOutputPath(job); if (outDir == null && job.getNumReduceTasks() != 0) { throw new InvalidJobConfException(”Output directory not set in JobConf.”); } if (outDir != null) { FileSystem fs = outDir.getFileSystem(job); // normalize the output directory outDir = fs.makeQualified(outDir); setOutputPath(job, outDir); // get delegation token for the outDir’s file system TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] {outDir}, job); // check its existence /* if (fs.exists(outDir)) { throw new FileAlreadyExistsException(”Output directory ” + outDir + ” already exists”); }*/ } } }

shreyag1207 · ‎09-27-2017

HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.

jsensharma · ‎08-25-2017

@Sheetal Sharma LongWritable is the WritableComparable for longs, Similarly IntWritable is a WritableComparable for ints. These interfaces [1] & [2] are all necessary for Hadoop/MapReduce, as the Comparable interface is used for comparing when the reducer sorts the keys, and Writable can write the result to the local disk. It does not use the java Serializable because java Serializable is too big or too heavy for hadoop, Writable can serializable the hadoop Object in a very light way. [1] https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/LongWritable.html [2] https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/IntWritable.html#IntWritable() "Comparable" is the interface whose abstract methods give us the flexibility to compare two objects. "Writable" is meant for writing the data to local disk and it's a serialization format. One can implement own Writables in Hadoop. Java’s serialization is too bulky and slow on the system. That’s why Hadoop community had put Writable in place. "WritableComparable" is a combination of the above two interfaces. "int" is a primitive type so it cannot be used as key-value. Integer is the wrapper class around it. So I’ll correct your question that what is the difference between Integer and IntWritable? "IntWritable" is the Hadoop variant of Integer which has been optimized for serialization in the Hadoop environment. An integer would use the default Java Serialization which is very costly in Hadoop environment. .

Online	Offline
Last Visited	‎08-10-2017 08:28 AM

Member Since	‎08-09-2017 05:25 AM
Last Visited	‎08-10-2017 08:28 AM
Posts	9

Cloudera Community

Re: How to overwrite an existing output file/dir d...

Re: Should we use RAID with Hadoop?

Re: Why we use IntWritable instead of Int? Why we ...