Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Business Case Definition using Hive and Spark

Business Case Definition using Hive and Spark

New Contributor

Hi guys,

I've in Hive a little star schema with the following dimensions:

  • - Dim_Computer -> Computer_ID; Computer_Name; OperatingSystem_Desc
  • - Dim_Drives -> Drive_ID; Drive_Name
  • - Dim_Customer-> Customer_ID; Customer_Name
  • - Dim_Date -> Date_ID; Day; Month; Quarter; Year
  • - Dim_Time -> Time_ID;

And now I've the Fact Table:

  • Computer_ID
  • Drive_ID
  • Customer_ID
  • Date_ID
  • Time_ID
  • Space_Used

Basically I've a Star Schema for Capacity Management and now I'm trying to create a predictive business case with this data using Hive and Spark but I can't get any good algorithm to aplly on this structure.

The objective of this model is to conclude that the Drive A from Computer X in Customer Y at that hour have the Z spaced_used.

Anyone have a good suggestion from an algorithm to apply on this?

Many thanks!!


Re: Business Case Definition using Hive and Spark

New Contributor

Hi Johnny,

Ideally, this kind of problem can be solved using Linear regression algorithm but the features you described are not measurable. ComputerID, CustomerID, DriveID, DateID can be labels but not the features. You may need some other measurable features in order to predict the spaced used at a given point in time.

If you can capture some other variables such as the amount of I/O on the computer, that might help.


Re: Business Case Definition using Hive and Spark

New Contributor

Hi anatva :) Many thanks for your response! Yes I can get that kind of information... the space total of the drive, the space used of the drives of each PC. When you refer Linear Regression is to predict what? :) Thanks!!!

Re: Business Case Definition using Hive and Spark

New Contributor

@Johnny, I meant to predict the space used on a computer/drive, based on other values you can get from the drive or computer such as (I/O on the computer, network I/O on the NIC, Number of Apps running on the computer etc.,) . Basically, we need to get the variables may influence the space stored on a disk. For example, if there is lot of I/O on the computer, it increases probability of more space used on the disk, (or) if the disk drive temperature is high, it may indicate that more space is used on that disk (or) if you can get spindle speed at a given point of time, that also indicates how busy the disk is.

So, your features may be:

I/O on computer

Network I/O on NIC

disk drive temperature

disk spindle speed

Number of databases on the computer

Number of apps on the computer etc.,

using above features, you can predict space used on the disk

Don't have an account?
Coming from Hortonworks? Activate your account here