About shreyag1207

shreyag1207 · ‎10-07-2017

Below two steps to delete the output directory(not recommended) in MapReduce: 1) using shell: bin/hadoop dfs -rmr /path/to/your/output/ 2) JAVA API: // configuration should contain reference to your namenode FileSystem fs = FileSystem.get(new Configuration()); // true stands for recursively deleting the folder you gave fs.delete(new Path(”/path/to/your/output”), true); If you want to override the existing: Need to overwrite the Hadoop OutputFormat class: public class OverwriteOutputDirOutputFile extends TextOutputFormat{ @Override public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException { // Ensure that the output directory is set and not already there Path outDir = getOutputPath(job); if (outDir == null && job.getNumReduceTasks() != 0) { throw new InvalidJobConfException(”Output directory not set in JobConf.”); } if (outDir != null) { FileSystem fs = outDir.getFileSystem(job); // normalize the output directory outDir = fs.makeQualified(outDir); setOutputPath(job, outDir); // get delegation token for the outDir’s file system TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] {outDir}, job); // check its existence /* if (fs.exists(outDir)) { throw new FileAlreadyExistsException(”Output directory ” + outDir + ” already exists”); }*/ } } }

shreyag1207 · ‎09-27-2017

HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.

shreyag1207 · ‎09-21-2017

Apache Spark –Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. To know more about Spark refer below link: Spark tutorial for beginners

shreyag1207 · ‎09-06-2017

shreyag1207 · ‎08-24-2017

shreyag1207 · ‎08-21-2017

Online	Offline
Last Visited	‎10-07-2017 10:17 AM

Member Since	‎07-22-2017 05:26 AM
Last Visited	‎10-07-2017 10:17 AM
Posts	15

Cloudera Community

Re: How to overwrite an existing output file/dir d...

Re: Should we use RAID with Hadoop?

Re: Why is spark has better speed than Hadoop

What are configuration files in Apache Hadoop?

What is Directed Acyclic Graph in Apache Spark?

Explain foreach() operation in apache spark