Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

if set up mongodb sharding with HDP cluster, the data nodes still needed in cluster?

Contributor

I installed HDP cluster with 4 data nodes. and add service mongodb/mongos/mongodconfig, which is mongodb with sharding I tried to add them to the HDP data nodes (2 grp and 2 nodes each). Somehow, the mongodb and mongos are unable to start.

so I start wondering,

1. if HDP, mongodb/mongos, should we still need data nodes in the system?

2. datanode and mongodb sharding nodes can be on the same node?

3. if not, Can we have 0 datanode in HDP cluster?

4. the kafka is in the system, we need to use kafka/spark, so data nodes are needed for spark/scala RDD in this architecture, right?

Any one have any clue on this, please help. thank you very much.

Robin

1 REPLY 1

Contributor

According to the mongodb doc here,

https://www.mongodb.com/blog/post/in-the-loop-with-hadoop-take-advantage-of-data-locality-with-mongo...

This means that we can accomplish local reads by putting a DataNode process (for Hadoop) or a Spark executor on the same node with a MongoDB shard and a mongos. In other words, we have an equal number of MongoDB shards and Hadoop or Spark worker nodes. We provide the addresses to all these mongos instances to the connector, so that the connector will automatically tag each InputSplit with the address of the mongos that lives together with the shard that holds the data for the split.

the above mentioned that we have an equal number of MongoDB shards and Hadoop or Spark worker nodes. I think it means number of mongodb shard = data nodes with spark. so my next step is to ensure sparks installed on all data nodes and then mongodb shards on these worker nodes.

Let me know if you think differently.

thanks

Robin