Community Articles

anarasimham · ‎02-26-2018

If you'd like to generate some data to test out the HDP/HDF platforms at a larger scale, you can use the following GitHub repository:

https://github.com/anarasimham/data-gen

This will allow you to generate two types of data:

Point-of-sale (POS) transactions, containing data such as transaction amount, time stamp, store ID, employee ID, part SKU, and quantity of product. These are transactions you make at a store when you are checking out. For simplicity's sake, this assumes each shopper only buys one product (potentially greater than 1 in quantity)
Automotive manufacturing parts production records, simulating the completion of parts in an assembly line. Imagine a warehouse completing different components of a car, such as the hood, front bumper, etc. at different points in time and those parts being tested for heat and vibration thresholds. This data will contain a timestamp of when the part was produced, thresholds for heat & vibration, values as tested for heat & vibration, quanity of produced part, a "short name" identifier for the part, a notes field, and a part location

Full details of both schemas are documented in the code in file datagen/datagen.py at the repository above.

The application is able to generate data and insert into one of two supported locations:

Hive
MySQL

You will need to configure the table by running one of the scripts in the mysql folder after connecting to the desired server and the desired database as the desired user.

Once that is done, you can copy the inserter/mysql.passwd.template file into inserter/mysql.passwd and edit it to provide the correct details. If you'd like to insert into Hive, do the same with the hive.passwd.template file. After editing, you can execute using the following command:

python main_manf.py 10 mysql

This will insert 10 rows of manufacturing data into the configured MySQL database table.

At this point, you're ready to explore your data in greater detail. Possible next steps include using NiFi to pull the data out of MySQL and push into Druid for a dashboard-style data lookup workflow. You can also push into Hive for ad-hoc analyses. These activities are out of scope for this article but are suggestions to think about.

Cloudera Community

Community Articles

Point-of-sale and Manufacturing Data Generation for Performance Testing and Other Non-Production Usage

Apache Hive