Community Articles

lwang · ‎02-09-2019

Use Case:

We have data stored in a MongoDB from a third party application in Amazon.

Export from MongoDB to Parquet.

Moving data from a single purpose data silo to your Enterprise Data Lake is a common use case. Using Apache NiFi we can easily save your data from this remote silo and bring it streaming into your analytics store for machine learning and deep analytics with Impala, Hive and Spark. It doesn't matter which cloud which are coming from or going to or from cloud to on-premise or various Hybrid situations. Apache NiFi will work in all of these situations which full data lineage and provenance on what it did when.

I have created a mock dataset with Mockaroo. It's all about yummy South Jersey sandwiches.

Our Easy MongoDB Flows to Ingest Mongo data to our Date Lake and another flow to load MongoDB.

In our test, we loaded all the data from our Mock REST API into a MongoDB in the cloud. In the real world an application populated that dataset and now we need to bring it into our central data lake for analytics.

We use Jolt to replace the non-Hadoop friendly built-in MongoDB _id with a friendly name mongo_id.

Storing to Parquet on HDFS is Easy (Let's compress with Snappy)

Connecting to MongoDB is easy, setup a controller and specify the database and collection.

Our MongoDB Connection Service, just enter your URI with username/password@server.

GetHTTP URL
https://my.api.mockaroo.com/hoagie.json

GetHTTP Filename
${filename:append('hoagie.'):append(${now():format('yyyyMMddHHmmSS'):append(${md5}):append('.json')})}

JSON Path Expression
$.*

JOLT Chain
[{
"operation": "shift",
"spec": {
"_id": "mongo_id",
"*": "&"
}
}]

Mongo URI
mongodb://user:userpassword@server.cloud.com:13916/nifi

Many files stored in HDFS as Parquet

Cloudera Community

Community Articles

Exporting and Importing Data from MongoDB in the Cloud with Apache NiFi

Apache NiFi