Member since
10-17-2016
8
Posts
0
Kudos Received
0
Solutions
02-27-2018
06:11 AM
Thank you for the information. The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)
... View more
02-26-2018
10:53 AM
Hi , I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster. I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes. From the following question I understand that running these commands is expensive on cluster https://community.hortonworks.com/questions/92214/can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my.html So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ? Thanks, Steev
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
02-18-2017
04:06 AM
Thank you Steve for those insights . These are very helpful for beginners like me.
... View more
02-16-2017
11:19 AM
Thank you Stevel for the answer. Yes, let us assume that the hardware and the OS(CentOS in my case )
supports hot swapping. You say it is difficult in a 3 node cluster. So
if I have 5 to 6 nodes then I can hot swap the disk without disturbing
currently running Spark job ?
... View more
02-15-2017
10:17 AM
Hi, We have a 3 node HDP cluster with Ambari 2.4 . We run Tera Sort jobs for bench marking. I would like to know how to hot swap a Data node hard disk (failed disk) without stopping the cluster services and without stopping on going Tera sort job ? Thanks, Steevan
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
10-26-2016
10:48 AM
saveAsTextFile also does this job. I used following code . rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])
... View more
10-25-2016
09:43 AM
Thank you for the response. I will try saveAsTextFile() API. Terasort scala code has following line which does not generate compressed output by default. sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile) Replacing the above line with following code solved this issue for me. sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])
... View more
10-22-2016
11:14 AM
Hi I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 . Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size. I tried following spark configuration options while submitting spark job. --conf spark.hadoop.mapred.output.compress=true --conf spark.hadoop.mapred.output.compression.codec=true --conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.compression.type=BLOCK This did not help. Also tried following and it did not help. --conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true --conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK How to get output data written HDFS in compressed format ? Thanks Steevan
... View more
Labels:
- Labels:
-
Apache Spark