Created 12-16-2016 11:56 PM
I did a bit of research and learned about the Mongo-Hadoop project, but I am not clear whether the project is also helpful for connecting to Spark.
Created 12-17-2016 12:05 AM
The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/
wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz mkdir mongo-hadoop tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1 # Now build the mongo-hadoop-spark jars cd mongo-hadoop ./gradlew jar cd .. cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/
Created 12-17-2016 12:05 AM
The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/
wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz mkdir mongo-hadoop tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1 # Now build the mongo-hadoop-spark jars cd mongo-hadoop ./gradlew jar cd .. cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/
Created 12-17-2016 12:07 AM
You did not specify the use case, but be aware of some limitations on bson files: https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files
You may want also to connect pyspark to MongoDB. Good reference: https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways
Created 12-17-2016 12:46 AM
Thank you. I'll test it and let you know.
Created 05-16-2017 06:07 AM
Hi @Jane Becker,
Apart from above answer, on the spark note, I believe you can use JDBC to extract the data into DataFrame,
Spark does support jdbc driver to load or save data, and documentation can be found here
PS : I have not tested on mongoDB and hope that works as the mongoDB JDBC driver be in generic JDBC driver standerd.