I have a lab environment of cdh5 with 6 nodes-node[1-6] and node7 as the nameNode.
node[1-5]: 8gb ram, 2 cores
node: 32gb ram, 8 cores
I am new to spark and I am trying to simply count the number of lines in our data. I have uploaded the data on hdfs (5.3GB).
When I submit my spark job, it only runs 2 executors and I can see its splitting the task into 161 task (there are 161 files in the dir).
In the code, I am reading all the files and doing the count on them.
data_raw = sc.textFile(path)print data_raw.count()
On CLI: spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1
It should run with 6 executors with 1 task running on them. But I only see 2 executors running. I am not able to figure the cause for it.
Any help would be greatly appreciated.