Support Questions

data_geek · ‎11-29-2022

Hi,

I have a requirement where the final destination to have the data in the Cloudera data warehouse and then connect to the Tableau dashboard.
My data sources are PostgreSQL and MongoDB .

What would be the best way to ingest data from MongoDB, the document structure is very dynamic- with nested loops and different objects .
Do I convert the MongoDB documents to relational and then import the data or Have the documents imported as is from Mongo DB directly ?
Would I be able to query the unstructured data if imported as is ?

Any help would be highly appreciated.

ggangadharan · ‎11-21-2023

Ingesting data from MongoDB into a Cloudera data warehouse, particularly Cloudera's CDH (Cloudera Distribution including Apache Hadoop), involves making decisions about data modeling and choosing the right approach based on your use case and requirements.

Considerations:

Schema Design:
MongoDB is a NoSQL database with a flexible schema, allowing documents in a collection to have different structures. If your goal is to maintain the flexibility and take advantage of the dynamic nature of MongoDB, you might consider storing documents as-is.

Data Modeling:
Decide whether you want to maintain a document-oriented model or convert the data to a more relational model. The decision may depend on your analysis and reporting requirements.

Storage Format:
In Cloudera environments, data is often stored in formats like Parquet or Avro. Consider the storage format that aligns with your performance and storage requirements.

HBaseStorageHandler:
Apache HBase along with HBaseStorageHandler for ingesting data from MongoDB into Cloudera. This approach involves storing the data in HBase tables and utilizing the HBaseStorageHandler to integrate HBase with Apache Hive.

Approaches:

Direct Import of MongoDB Documents:
In this approach, you ingest data directly from MongoDB into Cloudera. Tools like Apache Sqoop or MongoDB Connector for Hadoop can be used for this purpose.
The documents will be stored as-is in the Hive tables, allowing you to query unstructured data.

Converting MongoDB Documents to Relational Model:
Another approach involves converting MongoDB documents to a more structured, tabular format before ingesting into Cloudera. This conversion could be done using an ETL (Extract, Transform, Load) tool or a custom script.
This approach may be suitable if you have a specific schema in mind or if you want to leverage traditional SQL querying.

Querying Unstructured Data:

If you choose to import MongoDB documents as-is, you can still query unstructured data using tools like Apache Hive or Apache Impala. Both support querying data stored in various formats, including JSON. You can perform nested queries and navigate through the document structure.

Steps:

Direct Import:
Use a tool like Apache Sqoop or MongoDB Connector for Hadoop to import data directly into Cloudera.
Define Hive external tables to map to the MongoDB collections.

Convert and Import:
If you choose to convert, use an ETL tool like Apache NiFi or custom scripts to transform MongoDB documents into a structured format.
Import the transformed data into Cloudera.

Querying:
Use Hive or Impala to query the imported data.
For complex nested structures, explore Hive's support for JSON functions

Direct Import into HBase:
Use tools like Apache NiFi or custom scripts to extract data from MongoDB.
Transform the data into a suitable format for HBase, keeping in mind HBase's column-oriented storage.
Import the transformed data directly into HBase tables.

Integration with Hive using HBaseStorageHandler:
Create an external Hive table using the HBaseStorageHandler.
Define the mapping between the Hive table and the HBase table.

Example:

Here's a simplified example of how you might create an external Hive table with HBaseStorageHandler:

-- Create an external Hive table with HBaseStorageHandler
CREATE EXTERNAL TABLE hbase_mongo_data (
  id INT,
  name STRING,
  details STRUCT<field1:STRING, field2:INT, ...>,  -- Define the nested structure
  ...
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,cf:col1,cf:col2,details:field1,details:field2,..."
)
TBLPROPERTIES (
  "hbase.table.name" = "your_hbase_table_name"
);

Benefits and Considerations:

HBase's Schema Flexibility:
HBase provides schema flexibility, which can accommodate the dynamic structure of MongoDB documents.
You can define column families and qualifiers dynamically.

HBaseStorageHandler:
The HBaseStorageHandler allows you to interact with HBase tables using Hive, making it easier to query data using SQL-like syntax.

Integration with Cloudera Ecosystem:
HBase is part of the Cloudera ecosystem, and integrating it with Hive allows you to leverage the strengths of both technologies.

Querying Data:
Hive queries can directly access data in HBase tables using HBaseStorageHandler.
You can use Hive's SQL-like syntax for querying, and it provides some support for nested structures.

Connect Tableau to Hive:
Use Tableau to connect to the external Hive table with HBaseStorageHandler.
Tableau supports Hive as a data source, and you can visualize the data using Tableau's capabilities.

Optimize for Performance:
Depending on the size of your data, consider optimizing the HBase schema, indexing, and caching to enhance query performance.

Consideration for Tableau:

Tableau supports direct connectivity to Hive or Impala, allowing you to visualize and analyze the data stored in Cloudera. Ensure that the data format and structure are suitable for Tableau consumption.

Conclusion:

The best approach depends on your specific use case, requirements, and the level of flexibility you need in handling the MongoDB documents. If the dynamic nature of MongoDB documents is essential for your analysis, direct import with subsequent querying might be a suitable choice. If a more structured approach is needed, consider conversion before ingestion,Using HBase along with HBaseStorageHandler in Hive provides a powerful and flexible solution for integrating MongoDB data into the Cloudera ecosystem. This approach leverages the strengths of both HBase and Hive while enabling seamless integration with tools like Tableau for visualization.

Cloudera Community

Support Questions

Importing SQL from PostgreSQL and MongoDB documents

Exporting and Importing Data from MongoDB in the C...

Cloudera Data Platform documentation

Using Apache NiFi 1.2 with MongoDB

Some documentation error ?

How to import data from MongoDB to Hive or Hbase ?

Import data from HDFS to MongoDB

Data flow enrichment with NiFi part 3: LookupRecor...

sqoop import/export tutorial

Re: Solr TTL - Auto-Purging Solr Documents & Range...

Using PutMongoRecord to put CSV into MongoDB (Apa...