Member since
03-22-2017
26
Posts
3
Kudos Received
0
Solutions
11-08-2017
09:22 AM
Thanks for your suggestions and will try to incorporate your suggestions and come back to you with more questions!
... View more
11-07-2017
12:42 PM
kgautam, Thanks for your reply.
1) Currently, I'm not using any combiner. My map phase output
<key,value> pair is <string/text,string/text>. As my value
is string/text in map phase output <key,value> pair, I think that
It will be difficult to write the combiner. Usually,the function of the
combiner is same as the reducer. Here, I'm not able to think of writing
the combiner for this particular problem.
2) Currently,we tried with this compression for map output "-D
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec".
Is this configuration enough to compress map output? Do we have to
modify or write some statements in our mapreduce code to use this
compression?
3) May i know that where do you get this rule of thumb "A reducer should process 1 GB of data" ?
4) When i have 24 logical cores in one data node, Why you have mentioned 20 * 7? I think that it should be 24*7?
5) How to handle skewed key? Can i handle it using partitioner? Do we have any other way?
Thanks.
... View more
11-07-2017
11:08 AM
Dear Community, I have a Mapreduce job which processes 1.8TB data set. My map task generates around 2.5 TB of intermediate data and the number of distinct keys would easily cross a billion . I have set a split size to be 128MB. So, total number of splits generated is approximately 14,000/-. I have set a number of reducers to be 166. My cluster size is 8 nodes. 7 nodes are data nodes out of 8 nodes. 1 is a name node. Each data node has got 24 logical cores and 128GB RAM. When the job is running with this configuration, map completes its execution but my reduce phase stucks at 26%. May i know that what should be the split size and number of reducers i should have for this particular problem with my current cluster size. Please provide suggestions. Thanks.
... View more
Labels:
- Labels:
-
Apache Hadoop
08-17-2017
08:46 AM
1 Kudo
Hello Forum, I have read the following statement in http://www.dummies.com/programming/big-data/hadoop/input-splits-in-hadoops-mapreduce/ "In cases where the last record in a block is incomplete, the input split
includes location information for the next block and the byte offset of
the data needed to complete the record". I would like to know that is this statement true? Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
08-08-2017
03:20 PM
Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output. how to update sort algorithm? you have any tutorial for doing this? Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?
... View more
08-02-2017
10:10 AM
@Bala Vignesh N V Already, number of splits is 192. Should i still increase number of splits above 192? My each split is of not equal size. This is because length of each line in my data set is not fixed. I used nlinespermap property to make every map to get same number of lines for processing. But as the length of every line is not fixed, the split size is not same among all mappers. Is it good or bad that utilizing all the cores available in the clusters for map tasks in this situation? How about using Techyon for this problem. will i see any performance improvement if i use tachyon? Thanks in advance for your replies.
... View more
08-02-2017
03:44 AM
Hello, We are working on solving a problem which takes
10.2GB data set as its input. We have written a map reduce program
which analyzes this 10.2GB dataset and map task produces 33GB
intermediate data and reduce task generates 25GB output data. We have used NLineInpurFormat as inputformat.We are running this map reduce job in 24 nodes Hadoop2 cluster. Details of system configuration of each node is as follows Intel
i7 processor, 8cores, 8GB RAM, 360GB hard-disk, Network interface card
1gbps, switch 1gbps. But we are not using a dedicated switch for our
cluster. As we have
24*8=192 cores available, we are using all the 192 cores for our map
tasks. That is, we have divided data set into 192 splits so 192 map
tasks are created and all 192 cores are used.We have set number of
reduce tasks as 170. Apart from setting number of map and reduce tasks, we have not touched any of the hadoop parameters. Currently our job takes 9 mins 30 seconds for running. We
would like to know that are we lacking in setting any of the hadoop
parameters so that our job's performance in terms of time can be
improved? It would be really helpful for us if you can help in improving performance of this problem. Thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hadoop
07-31-2017
01:44 PM
Hello All, I would like to know that where map's intermediate data is written that is context.write() writes data to hard disk or network immediately after its generation? Which Hadoop parameter to be tuned when the amount of intermediate data generated by map() task is over 45 GB and a huge amount of data(for example above 50 GB) to be shuffled over the network in a multinode cluster set up? Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data? Thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hadoop
04-27-2017
01:50 PM
Thanks for your reply. My input data set is 10.2GB. But after map function in the program, the amount of intermediate data generated is around 40GB. Number of unique keys is around 4,55,55500. I think that Spark doesn't keep this much big intermediate data in memory but keeps in harddisk. Whats the best way to tune spark when we have to reduce massive number of unique keys
... View more
04-14-2017
10:26 AM
Hello Members, I have executed the following program (K-mer counter) in 2 nodes spark cluster on the data set size 10.2 GB. The amount of time it has taken to run is around 35 minutes. In another experiment, I have executed a Java program for the same K-mer counter problem in Hadoop-2 two nodes cluster. The amount of time it has taken is around 35 minutes. We know that Spark's cluster performance should be greater than Hadoop cluster. But in my case, the running time for both the cluster is same. I would like to know that In Spark cluster, am i utilizing all the available resources in 2 nodes cluster efficiently or not? Also, I would like to know that my Scala program for k-mer counter can be improvised than what i have now? My hardware is as follows: CPU: Intel i7 processor with 8 cores in both the machines in cluster RAM: 8GB in both the machines in cluster OS: Ubuntu 16.04 Harddisk capacity: Master node 1 TB and Slave node 500 GB Nodes are connected through switch in the network. Ip address of master: 172.30.16.233, Ip address of slave: 172.30.17.15
In case, if you need some explanation about the k-mer counter problem, Please look at the following If the given string is ATCGATGATT and the k-mer size is 5 then the following will be list of k-mers generated ATCGA 1 TCGAT 1 CGATG 1 GATGA 1 ATGAT 1 TGATT 1 My execution command is as follows spark-submit --class Kmer1 --master spark://saravanan:7077 --executor-memory 5g /home/hduser/sparkapp/target/scala-2.11/sparkapp_2.11-0.1.jar hdfs://172.30.16.233:54310//input hdfs://172.30.16.233:54310//output Sample input @SRR292770.1 FCB067LABXX:4:1101:1155:2103/1
GGAGTCATCATACGGCGCTGATCGAGACCGCAACGACTTTAAGGTCGCA
+
FFFFCFGDCGGGFCGBGFFFAEGFG;B7A@GEFBFGGFFGFGEFCFFFB My code will filter 1st line, 3rd line and the 4th line. Number of executors created is 2 Number of partitions created is 78 import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Kmer1 {
//
def main(args: Array[String]): Unit = {
//
val sparkConf = new SparkConf().setAppName("Kmer1")
val sc = new SparkContext(sparkConf)
val input = args(0)
val K = 25
val broadcastK = sc.broadcast(K)
val records = sc.textFile(input)
val only_three = records.zipWithIndex.filter{case (_,i)=>(i+1)%4 !=0}.map{case (e,_) =>e}
// remove the records, which are not an actual sequence data
val filteredRDD = only_three.filter(line => {
!(
line.startsWith("@") ||
line.startsWith("+") ||
line.startsWith(";") ||
line.startsWith("!") ||
line.startsWith("~") ||
line.startsWith(">")
)
})
val kmers = filteredRDD.flatMap(_.sliding(broadcastK.value, 1).map((_, 1)))
// find frequencies of kmers
val kmersGrouped = kmers.reduceByKey(_ + _)
kmersGrouped.saveAsTextFile(args(1))
// done!
sc.stop()
}
}
... View more
Labels:
- Labels:
-
Apache Spark