Created on 09-30-2022 02:58 PM - edited 11-01-2022 10:52 PM
As a healthcare provider / public health official, I want to respond equitably to the COVID-19 pandemic as quickly as possible, and serve all the communities that are adversely impacted in the state of California.
I want to use health equity data reported by California Department of Public Health (CDPH) to identify impacted members and accelerate the launch of outreach programs.
Collect - Ingest data from https://data.chhs.ca.gov/dataset/covid-19-equity-metrics using NiFi.
Enrich - Transform the dataset using Spark and load Hive tables.
Report - Gather insights using Hive tables and Data Visualization.
Predict - Connect to Hive tables and build Machine Learning (ML) models of your choice.
Steps to create this data pipeline, are as follows:
Please note that this data pipeline's documentation is in accordance with CDP Runtime Version 7.2.12.
Create or use a Data Hub Cluster with NiFi.
Following Data Hub Cluster type can be used for this exercise - "7.2.12 - Flow Management Light Duty with Apache NiFi, Apache NiFi Registry".
Go to NiFi user interface and upload NiFi-CDPH.xml as a template.
NiFi-CDPH.xml uses PutS3Object processor to connect to an existing Amazon S3 bucket. Please change the properties in this processor to use your own bucket.
If you don't use Amazon S3 storage, please replace PutS3Object processor with a processor of your own choice. Refer NiFi docs for details.
For quick reference, here are the frequently used processors to write to a file system -
Execute the flow and ensure InvokeHTTP processors are able to get covid19case_rate_by_social_det.csv and covid19demographicratecumulative.csv. Verify that these files are added to your storage bucket.
Once you're satisfied with functions of this NiFi flow, download the flow definition.
For reference, here's a picture of the flow in NiFi user interface -
-- Raw Data
select * from cdph.data_dictionary a;
select * from cdph.covid_rate_by_soc_det a;
select * from cdph.covid_demo_rate_cumulative a;
select * from member.member_profile a;
Go to Data VIZ user interface.
Under the DATA tab, create first Dataset - COVID Rate by Social Determinants
Dataset Details:
Update Dimensions & Measures to look like below:
Dataset Details:
Update Dimensions & Measures to look like below:
Once Datasets are available, go to VISUALS tab and create a new dashboard.
Let's create first visual in the dashboard, to show COVID-19 cases by income-groups. Select Default Hive VW and COVID Rate by Social Determinants from drop down menus, and create a new visual. Set the following parameters -
Let's create second visual in the dashboard, to show COVID-19 related deaths by age-groups. Select Default Hive VW and COVID Demographic Rate Cumulative from drop down menus, and create a new visual. Set the following parameters -
select * from member.target_mbrs_by_income a where social_tier = 'below $40K';
select * from member.target_mbrs_by_age_group a where demographic_set_category = '65+';
Go to CML user interface. Under ML Workspaces menu item, provision a new workspace. While provisioning a new workspace, enable Advanced Options and check "Enable Public IP Address for Load Balancer". This could take ~45 minutes to finish.
Once workspace is available, create a New Project. Under Initial Setup, Template tab is selected by default, that works for most users. But you also have options to start from scratch (Blank), use existing Applied Machine Leaning Prototypes (AMPs - see AMPs navigation menu item for details), use local files (Local Files) or Git repository (Git).
Download covid_outreach.ipynb and upload it in your project.
If multiple people are going to work on this project, add them as collaborators with the right role under Collaborators menu item.
Once you have the project setup, start a New Session. Select JupyterLab in Editor dropdown and check Enable Spark.
Execute the notebook and see data in Hive tables.
Now, you're ready to play around with the datasets and build your ML models.
Run Experiment under Experiments menu item when you have a draft model ready.
When you're ready to deploy the model, go to Models menu item and select New Model.
Once you're satisfied with the results of your model, create a New Job under Jobs menu item to setup arguments, schedule, notifications & so on.
For reference, please see menu items highlighted in Blue box that are referred in prior bullet points.