Created on 02-23-2026 08:58 AM
Government agencies and commercial entities must retain data for several years and commonly experience IT challenges due to increased data volumes and new sources coming online. Due to these factors, they are starting to undergo degradation in the performance of Security Information & Event Management (SIEM’s) tools like Splunk. To continue to meet mission needs, address the increase in data sources that require protection, and manage costs, they have started research strategies that complement their SIEM investments while looking for solutions that meet or exceed their organization’s policies.
The following requirements must be met to increase the performance of Splunk and maximize IT investment:
This blog will focus on how agencies use Cloudera Data Flow for universal data distribution as a solution for Splunk optimization with the technical details required to re-create this work. I will not provide details on how to deploy Cloudera Data Flow, instead I have provided sample data files and a template for a data flow, that can be found here.
I have also included a data flow template demonstrating how to route, filter, and aggregate data from Windows Application Event or IP Stream logs. This approach is designed to be hybrid in nature, meaning it can be used on-premises, in the cloud, or as a combination of both. Due to the flexibility of this architecture, it can be slowly phased into an existing Splunk deployment without interrupting the service.
There are many reasons agencies are seeking to leverage more cost-effective data storage locations to continue to scale and meet business SLAs. They also need flexibility in terms of supporting future architectures and applications. Currently, object storage has become the industry standard due to its ability to:
There are many different software and hardware vendors that support their object storage products, and many of them provide an API that is compliant with the AWS S3 API. For this blog, I will be using the native capabilities of Cloudera Data Flow to demonstrate writing a file as an object into a specific bucket location. To complete this task, the NiFi processor PutS3Object is required.
In this image, I will be using the QueryRecords NiFi processor to query and route the corresponding results to a file, which can be set to a different file format (CSV, JSON). Once the output file has been created, we can write this to a pre-existing object storage bucket. This bucket can be hosted on-premise or in AWS cloud infrastructure.
In this set of images, I am showing the configuration values of the PutS3Object NiFi processor to communicate with the object storage instance. This example is using an on-premise deployment of object storage using Apache Ozone. The IP address host containing Ozone Manager is required to complete this. The user will also need to create a bucket in this object storage instance.
If you’re using S3 in AWS’s cloud, you will need to provide the proper Region of the bucket, and you can leave the Endpoint Override URL value empty. In your environment, the user list may need to be set from your environment, but I am using the user “hadoop” for these values.
Please note that I have security enforcement on the AWS S3 protocol turned off. If security enforcement is turned on, the user must provide the values for Access Key ID and Secret Access Key that match their AWS credentials.
Please note that I have security enforcement on the AWS S3 protocol turned off. If security enforcement is turned on, the user must provide the values for Access Key ID and Secret Access Key that match their AWS credentials.
The configuration fields that are important to enter correctly are:
To improve the performance of Splunk or other SIEMs, it is critical to be selective about the data sent to be indexed and searched in the future. There are many advantages to limiting the amount of data an analyst has to sift through later. With less data to be searched through, it allows for queries to be completed quicker, and critical events can be given more attention or immediate action.
Concerning filtering data, Cloudera Data Flow can complete this task using native processors. Data can also be filtered based on any data or metadata associated with the flow file. In the following example, I will be using the event severity level as a filter to determine if data should be routed to our SIEM or object storage. This example will pull the severity level, which is a metadata value, and only send events that are non-information to our SIEM (i.e., Splunk). Information events will be routed to our object storage bucket.
To complete this task, I will use the QueryRecord NiFi processor to query incoming records for severity level and event codes. This acts as a filter and routes the results based on the query. In this example, the following SQL statements are used for severity and event codes. Please note, the RPATH call is used to traverse through nested JSON elements in the input file.
SELECT * FROM FLOWFILE WHERE RPATH ("result", '/severity') = 'Information'
SELECT * FROM FLOWFILE WHERE RPATH ("result", '/severity') = 'Warning'
SELECT * FROM FLOWFILE WHERE RPATH ("result", '/EventCode') = '102'
SELECT * FROM FLOWFILE WHERE RPATH ("result", '/EventCode') = '1001'
Removing unrequired or extra fields can be accomplished using the built-in SQL capabilities in the QueryRecords NiFi processor. Once those attributes are present in the query, they will be added to the output file that would be sent to a SIEM or an object storage instance.
Log aggregation can enable organizations with greater control over the flow of data through their infrastructure as well as control over how data is written to its destination. In many ways, this can be considered transformation in an ETL process. By taking advantage of the native capabilities of the QueryRecords NiFi processor, the results from the queries are merged together out of the box. This allows developers to leverage all of the results of the query into a single output file. The following images show the results of the query merged into a single output file.
All NiFi processors can replay any data that has been processed through them. This capability is a valuable tool for development or debugging purposes and is very easy to use. In the following images, a user must right-click on the processor of interest and select the option to view data provenance. A list of events will be displayed to the user, and one of those events can be selected. Under the Content tab, the replay button is available, which can be clicked to replay that data.
Moving data into a cloud-based Splunk instance can be accomplished using the native NiFi built-in processor called PutSplunk. This processor easily allows organizations to push data from their on-premise or cloud environment to Splunk instances regardless of location. By taking advantage of this capability, SIEM workloads can be straightforwardly moved to the cloud in parallel as depicted below.
The configuration of this processor must be set with the hostname of the Splunk instance. The following images display the processor and configuration screen. For this blog, the specific Splunk values have not been included.
One of the significant advantages of using Cloudera Data Flow is the ability to standardize how new data sources get collected and moved throughout an enterprise - build once, use many times.
By giving teams a standard approach to ingesting new data sources, organizations can streamline their process to acquire new datasets and use data in their missions or feed downstream applications. This approach decouples data sources from their destinations and brings immediate business value.
Organizations require observability of their data pipelines. Observability is critical in validating chains of custody for audits and validation, but it's also essential for performance. Cloudera Data Flow keeps a very granular level of detail about each piece of data that it ingests. According to the Apache NiFi documentation, “As the data is processed through the system, transformed, routed, split, aggregated, and distributed to other endpoints, an audit trail is created and stored within Apache NiFi's Provenance Repository.” This implies that any and all steps that are used to process data can be stored and tracked.
We can select Data Provenance from the Global Menu. All provenance events will be listed, and the complete data provenance can be viewed by selecting the icon on the right-hand side of the event. The out-of-the-box observability capabilities of Cloudera Data Flow allow for a comprehensive view of data pipelines, which is critical for organizations today.
The following images detail how the complete data provenance listing can be selected for all events. The last image displays all of the steps used for a particular event from the moment the data was acquired and ingested into Cloudera Data Flow.
Government agencies and commercial entities will need to continue to address growing requirements and IT challenges. They will need solutions that are flexible enough to enable hybrid deployments, possess the ability to scale to the growing volume of data, and complement existing IT investments like Splunk. In addition to these IT challenges, they will need an universal solution that can ingest data from new sources as they come online, such as IoT devices, and deliver data to destinations, such as cloud based applications or future storage devices. Due to these factors, using Cloudera Data Flow allows organizations to address the degradation in the performance of their SIEMs, and allow them to continue to meet and exceed the future needs of the mission.
Wherever you are on your hybrid cloud journey, a first-class data distribution service is critical for successfully adopting a modern hybrid data stack. Cloudera Data Flow for the Public Cloud provides a universal, hybrid, and streaming-first data distribution service that enables customers to gain control of their data flows.
Take our interactive product tour to get an impression of Cloudera Data Flow in action or sign up for a free trial.