Member since
01-15-2019
60
Posts
37
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2975 | 07-20-2021 01:05 AM | |
16556 | 11-28-2019 06:59 AM |
08-19-2025
01:05 AM
Several keys needed to be added: This is an example of the properties we used in KConnect in DH ---------------------------- 1- producer.override.sasl.jaas.config org.apache.kafka.common.security.plain.PlainLoginModule required username="<your-workload-name>" password="<password>"; 2- producer.override.security.protocol SASL_SSL 3- producer.override.sasl.mechanism PLAIN ----------------------------
... View more
08-18-2025
08:52 AM
H i@shubham_rai Do you have a chance to try the Custom Service on a CDP Base (on-premises) version? If you run it on CDP On-Premises, do you get the same error message?
... View more
08-17-2025
10:28 PM
@cnelson2 This is really helpful! Thanks!
... View more
05-23-2025
06:19 AM
1 Kudo
This guide provides a step-by-step approach to extracting data from SAP S/4HANA via OData APIs, processing it using Apache NiFi in Cloudera Data Platform (CDP), and storing it in an Iceberg-based Lakehouse for analytics and AI workloads. 1. Introduction 1.1 Why Move SAP S/4HANA Data to a Lakehouse? SAP S/4HANA is a powerful ERP system designed for transactional processing, but it faces limitations when used for analytics, AI, and large-scale reporting: Performance Impact: Running complex analytical queries directly on SAP can degrade system performance. Limited Scalability: SAP systems are not optimized for big data workloads (e.g., petabyte-scale analytics). High Licensing Costs: Extracting and replicating SAP data for analytics can be expensive if done inefficiently. Lack of Flexibility: SAP’s data model is rigid, making it difficult to integrate with modern AI/ML tools. A Lakehouse architecture (built on Apache Iceberg in CDP) solves these challenges by: Decoupling analytics from SAP – Reduce operational load on SAP while enabling scalable analytics. Supporting structured & unstructured data – Unlike SAP’s tabular model, a Lakehouse can store JSON, text, and IoT data. Enabling ACID compliance – Iceberg ensures transactional integrity (critical for financial and inventory data). Reducing costs – Store historical SAP data in cheaper object storage (S3, ADLS) rather than expensive SAP HANA storage. 1.2 Why Use OData API for SAP Data Extraction? SAP provides several data extraction methods, but OData (Open Data Protocol) is one of the most efficient for real-time replication: Method Pros Cons Best For OData API Real-time, RESTful, easy to use Requires pagination handling Incremental, near-real-time syncs SAP BW/Extractors SAP-native, optimized for BW Complex setup, not real-time Legacy SAP BW integrations Database Logging (CDC) Low latency, captures all changes High SAP system overhead Mission-critical CDC use cases SAP SLT (Trigger-based) Real-time, no coding needed Expensive, SAP-specific Large-scale SAP replication Why OData wins for Lakehouse ingestion? REST-based – Works seamlessly with NiFi’s InvokeHTTP processor. Supports filtering ($filter) – Enables incremental extraction (e.g., modified_date gt ‘2024-01-01’). JSON/XML output – Easy to parse and transform in NiFi before loading into Iceberg. 1.3 Why Apache NiFi in Cloudera Data Platform (CDP)? NiFi is the ideal tool for orchestrating SAP-to-Lakehouse pipelines because: Low-Code UI: Drag-and-drop processors simplify pipeline development (vs. writing custom Spark/PySpark code). Built-in SAP Connectors: Use InvokeHTTP for SAP S/4 HANA OData for deeper integrations. Scalability & Fault Tolerance: Backpressure handling – Prevents SAP API overload. Automatic retries – If SAP API fails, NiFi retries without data loss. 2. Prerequisites Before building the SAP S/4HANA → NiFi → Iceberg pipeline, ensure the following components and access rights are in place. Cloudera Data Platform (CDP) with: Apache NiFi (for data ingestion) Apache Iceberg (as the Lakehouse table format) Storage: HDFS or S3 (via Cloudera SDX) SAP S/4HANA access with OData API permissions T-Code SEGW: Confirm OData services are exposed (e.g., API_MATERIAL_SRV). Permissions: SAP User Role: Must include S_ODATA and S_RFC authorizations. Whitelist NiFi IP if SAP has network restrictions. Test OData Endpoints curl -u "USER:PASS" "https://sap-odata.example.com:443/sap/opu/odata/sap/API_SALES_ORDER_SRV/A_SalesOrder?$top=2" Validate: Pagination ($skip, $top). Filtering ($filter=LastModified gt '2025-05-01'). Basic knowledge of NiFi flows, SQL, and Iceberg 3. Architecture Overview Data movement: SAP S/4HANA (OData API) → Apache NiFi (CDP) → Iceberg Tables (Lakehouse) → Analytics (Spark, Impala, Hive) Archtecture Overview : 4. Step-by-Step Implementation Step 1: Identify SAP OData Endpoints SAP provides OData services for tables like: MaterialMaster (MM) SalesOrders (SD) FinancialDocuments (FI) Example endpoint: https://<SAP_HOST>:<PORT>/sap/opu/odata/sap/API_SALES_ORDER_SRV/A_SalesOrder?$top=2 Step 2: Configure NiFi to Extract SAP Data Use InvokeHTTP processor to call SAP OData API. Configure authentication (Basic Auth). Handle pagination ($skip & $top parameters). To get the JSON response, I added Accept=application/json Property. Parse JSON responses using EvaluateJsonPath or JoltTransformJSON. Step 3: Transform Data in NiFi Filter & clean data using: ReplaceText (for SAP-specific formatting) QueryRecord (to convert JSON to Parquet/AVRO) Enrich data (e.g., join with reference tables). Check the Data using Provinance : Step 4: Load into Iceberg Lakehouse Use PutIceberg processor (NiFi 1.23+) to write directly to Iceberg. Alternative Option: Write to HDFS/S3 as Parquet, then use Spark SQL to load into Iceberg CREATE TABLE iceberg_db.sap_materials (
material_id STRING,
material_name STRING,
created_date TIMESTAMP
)
STORED AS ICEBERG; 5. Conclusion By leveraging Cloudera’s CDP, NiFi, and Iceberg, organizations can efficiently move SAP data into a modern Lakehouse, enabling real-time analytics, ML, and reporting without impacting SAP performance. Next Steps Explore Cloudera Machine Learning (CML) for SAP data analytics.
... View more
Labels:
03-18-2025
01:14 AM
Hi, @APentyala Thanks for pointing out this. Impala drivers also works well on this. Both Impala and Hive drivers can work on this. I will replace the images so that it matches the descriptions 👍🏻
... View more
09-10-2024
05:34 PM
1 Kudo
In CDP Public Cloud CDW Impala, you can only use HTTP+SSL to access, So you have to Edit the config file to specify ODBC Driver C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Cloudera ODBC Driver for Impala\lib\cloudera.impalaodbc.ini [Driver]
AllowHostNameCNMismatch = 0
CheckCertRevocation = 0
TransportMode = http
AuthMech=3 https://community.cloudera.com/t5/Community-Articles/How-to-Connect-to-CDW-Impala-VW-Using-the-Power-BI-Desktop/ta-p/393013#toc-hId-1805728480
... View more
09-08-2024
10:36 PM
With the Hive (newer than Hive 2.2), you can use Merge INTO MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age)
VALUES (source.id, source.name, source.age);
... View more
09-03-2024
06:27 PM
Summary Last week I posted an article [How to Connect to Impala Using the Power BI Desktop + Cloudera ODBC Impala Driver with Kerberos Authentication], So far (2024 Sep), CDP Public Cloud CDW supports Basic Authentication (HTTP), today I will share how to Connect to CDW Impala VW Using the Power BI Desktop + Cloudera ODBC Impala Driver with Basic Authentication. Pre-requisites Power BI Desktop Edition https://www.microsoft.com/en-us/power-platform/products/power-bi/desktop Impala in CDP Public Cloud CDW Impala ODBC Connector 2.7.0 for Cloudera Enterprise https://www.cloudera.com/downloads/connectors/impala/odbc/2-7-0.html How-To-do in Power BI Desktop Step1 : Install the [Impala ODBC Connector 2.7.0 for Cloudera Enterprise] Step 2: Copy the ODBC folder to Power BI Desktop folder Assume your Power BI Desktop is in [ C:\Program Files\Microsoft Power BI Desktop\ ], then please copy the ODBC Driver [ C:\Program Files\Cloudera ODBC Driver for Impala ] to [C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Cloudera ODBC Driver for Impala] Step 3: Edit the config file to specify ODBC Driver C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Impala ODBC Driver.ini [Simba Impala ODBC Driver]
# Oringinally PowerBI will use its embedded driver, we can change it to Cloudera version
# Driver=Simba Impala ODBC Driver\ImpalaODBC_sb64.dll
Driver=Cloudera ODBC Driver for Impala\lib\ClouderaImpalaODBC64.dll Step 4: Edit the config file to specify ODBC Driver parameter C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Cloudera ODBC Driver for Impala\lib\cloudera.impalaodbc.ini [Driver]
AllowHostNameCNMismatch = 0
CheckCertRevocation = 0
TransportMode = http
AuthMech=3 * Cloudera CDW Impala VW doesn't need [httpPath] parameter, while Cloudera Datahub Impala cluster need [httpPath=cliservice]. Please be careful. Then save these two files, and restart your Power BI Desktop. How-To-do in Power BI Service (On-premise Data Gateway) Step 1: Edit the config file to specify Driver C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Impala ODBC Driver.ini [Simba Impala ODBC Driver]
# Driver=Simba Impala ODBC Driver\ImpalaODBC_sb64.dll
Driver=Cloudera ODBC Driver for Impala\lib\ClouderaImpalaODBC64.dll Step 2: Edit the Simba .ini file to specify Driver parameter C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Cloudera ODBC Driver for Impala\lib\cloudera.impalaodbc [Driver]
AllowHostNameCNMismatch = 0
CheckCertRevocation = 0
TransportMode = http
AuthMech=3 Reference: https://community.fabric.microsoft.com/t5/Desktop/Power-BI-Impala-connector-SSL-certificate-error/m-p/2344481#M845491
... View more
Labels:
08-29-2024
06:52 AM
Pre-requisites
Power BI Desktop Edition
Impala in CDP Private Cloud Base
Impala ODBC Connector 2.7.0 for Cloudera Enterprise
CDP Public Cloud Datahub + Kerberos + Power BI Desktop (in the future)
Process
Step 1: Install the [Impala ODBC Connector 2.7.0 for Cloudera Enterprise]
Step 2: Copy the ODBC folder to Power BI Desktop folder
Assume your Power BI Desktop is in [ C:\Program Files\Microsoft Power BI Desktop\ ], then please copy the ODBC Driver [ C:\Program Files\Cloudera ODBC Driver for Impala ] to [C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Cloudera ODBC Driver for Impala]
Step 3: Edit the config file
[ C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Impala ODBC Driver.ini ]
[Simba Impala ODBC Driver]
# Oringinally PowerBI will use its embedded driver, we can change it to Cloudera version
# Driver=Simba Impala ODBC Driver\ImpalaODBC_sb64.dll
Driver=Cloudera ODBC Driver for Impala\lib\ClouderaImpalaODBC64.dll
# If you don't use SSL
SSL=0
Step 4: Run Power BI Desktop
If you want to use Windows AD Kerberos, then please don't install MIT Kerberos, and make sure you are using the Windows AD domain account to log in and run the Power BI Desktop application.
* Reference :
In the [Cloudera-ODBC-Connector-for-Impala-Install-Guide.pdf], it mentioned that,
==============
Configuring Kerberos Authentication for Windows
Active Directory
The Cloudera ODBC Connector for Apache Impala supports Active Directory Kerberos on Windows. There are two prerequisites for using Active Directory Kerberos on Windows:
MIT Kerberos is not installed on the client Windows machine.
The MIT Kerberos Hadoop realm has been configured to trust the Active Directory realm, according to Apache's documentation, so that users in the Active Directory realm can access services in the MIT Kerberos Hadoop realm.
==============
Step 5: Test Connection using ODBC
Open the [ODBC Data Source (64-bit)] application, and add a new DSN for testing.
I didn't use SSL, so I left the SSL un-checked.
For debug, you can set the log level to DEBUG.
As we will be using Kerberos to connect to Impala, please ensure that the AD server and DNS server are correctly configured.
Step 6: Fetch data in Power BI Desktop
In Power BI Desktop, use the [Get data -> more],
Input [Impala] to search for the connector,
Then, input the server info:
After that, you can use Windows Authentication (as Kerberos), input the
DOMAIN\account
password
Then you can see the window below:
At last, you can see the data after loading:
... View more
Labels:
05-19-2024
12:43 AM
2 Kudos
Purpose:
Run SELECT to ingest data from Oracle 19c, and save the data into Azure ADLS Gen2 object storage, in Parquet format.
Steps
Step 1 Prepare the environment
Make sure the Oracle 19c environment works well.
Prepare an Oracle table:
CREATE TABLE demo_sample (
column1 NUMBER,
column2 NUMBER,
column3 NUMBER,
column4 VARCHAR2(10),
column5 VARCHAR2(10),
column6 VARCHAR2(10),
column7 VARCHAR2(10),
column8 VARCHAR2(10),
column9 VARCHAR2(10),
column10 VARCHAR2(10),
column11 VARCHAR2(10),
column12 VARCHAR2(10),
CONSTRAINT pk_demo_sample PRIMARY KEY (column1, column2, column3, column4, column5, column6, column7, column8, column9)
);
Prepare 20000 records data:
import cx_Oracle
import random
# Oracleデータベース接続情報
dsn = cx_Oracle.makedsn("<your Oracle database>", 1521, service_name="PDB1")
connection = cx_Oracle.connect(user="<your user name>", password="<your password>", dsn=dsn)
# データ挿入関数
def insert_data():
cursor = connection.cursor()
sql = """
INSERT INTO demo_sample (
column1, column2, column3, column4, column5, column6,
column7, column8, column9, column10, column11, column12
) VALUES (
:1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12
)
"""
batch_size = 10000
data = []
for i in range(20000): # 2万件
record = (
random.randint(1, 1000),
random.randint(1, 1000),
random.randint(1, 1000),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10)),
''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=10))
)
data.append(record)
if len(data) == batch_size:
cursor.executemany(sql, data)
connection.commit()
data = []
if data:
cursor.executemany(sql, data)
connection.commit()
cursor.close()
# メイン処理
try:
insert_data()
finally:
connection.close()
Step 2 Processor: ExecuteSQLRecord
This ExecuteSQLRecord uses two Service,
Database Connection Pooling Service: DBCPConnectionPool Processor, named EC2-DBCPConnectionPool.
ParquetRecordSetWriter, named ParquetRecordSetWriter.
Step 3: Create DBCPConnectionPool
Download the Oracle JDBC Driver from here https://www.oracle.com/jp/database/technologies/appdev/jdbc-downloads.html
Save the jdbc driver here (or anywhere your nifi can access):
/Users/zzeng/Downloads/tools/Oracle_JDBC/ojdbc8-full/ojdbc8.jar
DBCPConnectionPool Properties:
Database Connection URL: The JDBC Driver URI. eg. jdbc:oracle:thin:@//ec2-54-222-333-444.compute-1.amazonaws.com:1521/PDB1
Database Driver Class Name: oracle.jdbc.driver.OracleDriver
Database Driver Location(s) : /Users/zzeng/Downloads/tools/Oracle_JDBC/ojdbc8-full/ojdbc8.jar
Database User: my Oracle access user name, eg zzeng
Password: Password, will be automatically encrypted by NiFi
Step 4: Create ParquetRecordSetWriter service
We can use default settings here.
Step 5: UpdateAttribute to set the file name in Azure
Add a value :
Key: azure.filename Value : ${uuid:append('.ext')}
Step 6: Use PutAzureDataLakeStorage to save data into Azure
Step 7: Create ADLSCredentialsControllerService service for PutAzureDataLakeStorage so that we can save data into Azure
Storage Account Name: the value in your Azure account
SAS Token: The value in your Azure account
Step 8: Enable the 3 services
Step 9: Have a try
Choose `Run Once`
And you will find the files are there
... View more
Labels: