Support Questions

Find answers, ask questions, and share your expertise

How to connect MongoDB with Hadoop and Spark?

avatar
Contributor

I did a bit of research and learned about the Mongo-Hadoop project, but I am not clear whether the project is also helpful for connecting to Spark.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Jane Becker

The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/

wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz
mkdir mongo-hadoop
tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1

# Now build the mongo-hadoop-spark jars
cd mongo-hadoop
./gradlew jar
cd ..
cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/

View solution in original post

4 REPLIES 4

avatar
Super Guru

@Jane Becker

The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/

wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz
mkdir mongo-hadoop
tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1

# Now build the mongo-hadoop-spark jars
cd mongo-hadoop
./gradlew jar
cd ..
cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/

avatar
Super Guru

You did not specify the use case, but be aware of some limitations on bson files: https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files

You may want also to connect pyspark to MongoDB. Good reference: https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways

avatar
Contributor

@Constantin Stanca

Thank you. I'll test it and let you know.

avatar
Super Collaborator

Hi @Jane Becker,

Apart from above answer, on the spark note, I believe you can use JDBC to extract the data into DataFrame,

Spark does support jdbc driver to load or save data, and documentation can be found here

PS : I have not tested on mongoDB and hope that works as the mongoDB JDBC driver be in generic JDBC driver standerd.