Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to connect MongoDB with Hadoop and Spark?

avatar
New Member

I did a bit of research and learned about the Mongo-Hadoop project, but I am not clear whether the project is also helpful for connecting to Spark.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Jane Becker

The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/

wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz
mkdir mongo-hadoop
tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1

# Now build the mongo-hadoop-spark jars
cd mongo-hadoop
./gradlew jar
cd ..
cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/

View solution in original post

4 REPLIES 4

avatar
Super Guru

@Jane Becker

The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/

wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz
mkdir mongo-hadoop
tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1

# Now build the mongo-hadoop-spark jars
cd mongo-hadoop
./gradlew jar
cd ..
cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/

avatar
Super Guru

You did not specify the use case, but be aware of some limitations on bson files: https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files

You may want also to connect pyspark to MongoDB. Good reference: https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways

avatar
New Member

@Constantin Stanca

Thank you. I'll test it and let you know.

avatar
Super Collaborator

Hi @Jane Becker,

Apart from above answer, on the spark note, I believe you can use JDBC to extract the data into DataFrame,

Spark does support jdbc driver to load or save data, and documentation can be found here

PS : I have not tested on mongoDB and hope that works as the mongoDB JDBC driver be in generic JDBC driver standerd.