Support Questions
Find answers, ask questions, and share your expertise

Business Case Definition using Hive and Spark


Hi guys,

I've in Hive a little star schema with the following dimensions:

  • - Dim_Computer -> Computer_ID; Computer_Name; OperatingSystem_Desc
  • - Dim_Drives -> Drive_ID; Drive_Name
  • - Dim_Customer-> Customer_ID; Customer_Name
  • - Dim_Date -> Date_ID; Day; Month; Quarter; Year
  • - Dim_Time -> Time_ID;

And now I've the Fact Table:

  • Computer_ID
  • Drive_ID
  • Customer_ID
  • Date_ID
  • Time_ID
  • Space_Used

Basically I've a Star Schema for Capacity Management and now I'm trying to create a predictive business case with this data using Hive and Spark but I can't get any good algorithm to aplly on this structure.

The objective of this model is to conclude that the Drive A from Computer X in Customer Y at that hour have the Z spaced_used.

Anyone have a good suggestion from an algorithm to apply on this?

Many thanks!!


Cloudera Employee

Hi Johnny,

Ideally, this kind of problem can be solved using Linear regression algorithm but the features you described are not measurable. ComputerID, CustomerID, DriveID, DateID can be labels but not the features. You may need some other measurable features in order to predict the spaced used at a given point in time.

If you can capture some other variables such as the amount of I/O on the computer, that might help.


Hi anatva 🙂 Many thanks for your response! Yes I can get that kind of information... the space total of the drive, the space used of the drives of each PC. When you refer Linear Regression is to predict what? 🙂 Thanks!!!

Cloudera Employee

@Johnny, I meant to predict the space used on a computer/drive, based on other values you can get from the drive or computer such as (I/O on the computer, network I/O on the NIC, Number of Apps running on the computer etc.,) . Basically, we need to get the variables may influence the space stored on a disk. For example, if there is lot of I/O on the computer, it increases probability of more space used on the disk, (or) if the disk drive temperature is high, it may indicate that more space is used on that disk (or) if you can get spindle speed at a given point of time, that also indicates how busy the disk is.

So, your features may be:

I/O on computer

Network I/O on NIC

disk drive temperature

disk spindle speed

Number of databases on the computer

Number of apps on the computer etc.,

using above features, you can predict space used on the disk

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.