About steevan_rodrigu

steevan_rodrigu · ‎02-27-2018

Thank you for the information. The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)

steevan_rodrigu · ‎02-26-2018

Hi , I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster. I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes. From the following question I understand that running these commands is expensive on cluster https://community.hortonworks.com/questions/92214/can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my.html So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ? Thanks, Steev

steevan_rodrigu · ‎02-18-2017

Thank you Steve for those insights . These are very helpful for beginners like me.

steevan_rodrigu · ‎02-16-2017

Thank you Stevel for the answer. Yes, let us assume that the hardware and the OS(CentOS in my case ) supports hot swapping. You say it is difficult in a 3 node cluster. So if I have 5 to 6 nodes then I can hot swap the disk without disturbing currently running Spark job ?

steevan_rodrigu · ‎02-15-2017

Hi, We have a 3 node HDP cluster with Ambari 2.4 . We run Tera Sort jobs for bench marking. I would like to know how to hot swap a Data node hard disk (failed disk) without stopping the cluster services and without stopping on going Tera sort job ? Thanks, Steevan

steevan_rodrigu · ‎10-26-2016

saveAsTextFile also does this job. I used following code . rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])

steevan_rodrigu · ‎10-25-2016

Thank you for the response. I will try saveAsTextFile() API. Terasort scala code has following line which does not generate compressed output by default. sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile) Replacing the above line with following code solved this issue for me. sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable], classOf[TextOutputFormat[Text,IntWritable]], classOf[org.apache.hadoop.io.compress.SnappyCodec])

steevan_rodrigu · ‎10-22-2016

Hi I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 . Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size. I tried following spark configuration options while submitting spark job. --conf spark.hadoop.mapred.output.compress=true --conf spark.hadoop.mapred.output.compression.codec=true --conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.compression.type=BLOCK This did not help. Also tried following and it did not help. --conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true --conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK How to get output data written HDFS in compressed format ? Thanks Steevan

Online	Offline
Last Visited	‎02-27-2018 06:11 AM

Member Since	‎10-17-2016 11:16 AM
Last Visited	‎02-27-2018 06:11 AM
Posts	8

Cloudera Community

Re: Does hadoop run dfs -du automatically when a n...

Does hadoop run dfs -du automatically when a new j...

Re: How to hotswap Data node hard disk without sto...

Re: How to hotswap Data node hard disk without sto...

How to hotswap Data node hard disk without stoppin...

Re: Spark terasort : How to compress output data w...

Re: Spark terasort : How to compress output data w...

Spark terasort : How to compress output data writt...