I've in Hive a little star schema with the following dimensions:
And now I've the Fact Table:
Basically I've a Star Schema for Capacity Management and now I'm trying to create a predictive business case with this data using Hive and Spark but I can't get any good algorithm to aplly on this structure.
The objective of this model is to conclude that the Drive A from Computer X in Customer Y at that hour have the Z spaced_used.
Anyone have a good suggestion from an algorithm to apply on this?
Ideally, this kind of problem can be solved using Linear regression algorithm but the features you described are not measurable. ComputerID, CustomerID, DriveID, DateID can be labels but not the features. You may need some other measurable features in order to predict the spaced used at a given point in time.
If you can capture some other variables such as the amount of I/O on the computer, that might help.
Hi anatva 🙂 Many thanks for your response! Yes I can get that kind of information... the space total of the drive, the space used of the drives of each PC. When you refer Linear Regression is to predict what? 🙂 Thanks!!!
@Johnny, I meant to predict the space used on a computer/drive, based on other values you can get from the drive or computer such as (I/O on the computer, network I/O on the NIC, Number of Apps running on the computer etc.,) . Basically, we need to get the variables may influence the space stored on a disk. For example, if there is lot of I/O on the computer, it increases probability of more space used on the disk, (or) if the disk drive temperature is high, it may indicate that more space is used on that disk (or) if you can get spindle speed at a given point of time, that also indicates how busy the disk is.
So, your features may be:
I/O on computer
Network I/O on NIC
disk drive temperature
disk spindle speed
Number of databases on the computer
Number of apps on the computer etc.,
using above features, you can predict space used on the disk