Member since
10-24-2019
2
Posts
0
Kudos Received
0
Solutions
11-04-2019
07:52 AM
2 Kudos
Introduction In this article we’re going to look at Logistic regression which is a technique widely used in industry to solve binary classification problems. Unlike the linear regression model, which is used to make predictions on the numerical target variable, logistic regression has the ability to deal with discrete valued outcomes. For example, a numerical target variable might involve predicting house prices, whereas a discrete valued outcome might involve predicting who would win in the next presidential elections. Logistic regression is most commonly used in solving binary classification problems, such as tumor detection, credit card fraud detection, email spam detection, and so on. How the logistic regression model works The logistic regression model makes use of the sigmoid function (also known as the logistic function) to measure the relationship between input variables and the output variable by estimating probability scores of the outcome. Unlike linear regression, logistic regression uses a different cost function by making use of a sigmoid function to make a prediction fall between 0 and 1. What is a sigmoid function? The main reason behind using a sigmoid function as the cost function is that it helps to map predicted values to probabilities that hold the values in the range 0 to 1. Equation used to represent a sigmoid function: 1/1+exp(-x). Plot of sigmoid function is shown as below. Hypothesis representation of the logistic regression model The hypothesis function in linear regression is represented as h(x)=β₀ + β₁X. In logistic regression, there is a slight modification to the hypothesis function, so that it becomes: hθ(x) = 1/(1 + e^-(β₀ + β₁X) The major change you can observe here is the use of an additional sigmoid function. Decision boundary: We expect a logistic regression classifier to produce a set of outputs based on probability when we pass the inputs through a prediction function and it returns a probability score between 0 and 1. For example, if you have two classes—say, tumor and non-tumor samples—the logistic regression model decides on a threshold value, above which we classify values as class 1, and below which we classify as class 2. Fig 1: Threshold For example, in the figure above, 0.5 is considered the threshold. If the prediction function returned 0.7, then it is considered as class 1 (tumor). If the prediction function returned 0.2, then it is considered as class 2 (non-tumor). Cost function: The cost function is responsible for calculating the error between the actual value and the predicted value. After the cost function is created, our goal is to minimize the error with an optimization function. In this article, we will discuss gradient descent optimization, which is widely used for this purpose. For logistic regression, the cost function is defined as: When the true value is 1 and the predicted value(y) is 1, the cost is zero. When the prediction is far away from 1, the cost increases as shown in the above image. In linear algebra, this type of function is represented as: −log(hθ(x)) if y = 1 Similarly, when the actual value is 0 and the predicted value is exactly 0, the cost is 0. If the predicted value is moving far from the actual value, the cost increases as shown in the figure below. This type of plot in linear algebra is: −log(1−hθ(x)) if y = 0 Gradient descent optimization To minimize the cost function value, we will be looking into an optimization algorithm known as gradient descent. To minimize a cost function, we need to run the gradient descent function on each parameter. Below is the pseudo code used by gradient descent: Initialize the parameters Repeat { Make a prediction on y Calculate cost function Get gradient for cost function Update parameters } Sample implementation of logistic regression Let’s look into the pseudo code for important steps in building a logistic regression model with Python’s powerful scikit-learn library. Load a dataset: Import all the necessary steps and load the dataset into a dataframe. Apply feature scaling: Feature scaling is important to standardize data that feeds a model. Train_test split function: Splits the data into a training set and a test set. Fit the logistic regression model: Instantiate the logistic regression model and train the model with the data. Prediction and testing: Predict the test set data using the trained model. Evaluation of metrics: Evaluate metrics with a confusion matrix, precision/recall, or f-score, depending on the metric type requirements. Pseudo code for building a linear regression model using scikit-learn: # Import all the necessary libraries
#import your dataset
dataset= pd.read_csv(“example.csv”)
#specify your input and output variables based on your data, varies on different datasets
X= dataset.iloc[:,[2,3]].values
Y= dataset.iloc[:,4].values
#Split the data into training and testing set
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.2,random_state=42)
#Perform feature scaling
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc._X.fit_transform(X_test)
#Fit linear regression model to the data
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
classifier.fit(X_train,y_train)
#predict results using testing set
y_pred=classifier.predict(X_test)
#Evaluate your results using metrics, here confusion matrix is being used
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred) Conclusion: Congratulations! In this article you have learned the basic concepts behind a logistic regression model. Learn more about building a logistic regression model on breast cancer analysis using Cloudera Data Science Workbench by completing theBreast cancer analysis using a logistic regression model
... View more
10-28-2019
12:27 PM
2 Kudos
Introduction
In this article we’re going to look at linear regression, which is a technique for estimating linear relationships between various features and a continuous target variable. Linear regression provides the ability to measure a correlation between your input data and a response variable. We’ll illustrate this concept with a simple example. Suppose you have collected data for employees’ years of experience and their corresponding salaries. To understand the distribution of this data we can generate a plot as shown in Fig 1. You can see that there’s a linear relationship between years of experience (x) and salary (y). This correlation allows us to build a linear regression model that outputs a salary given years of experience.
Fig1 : Points plotted using training examples (years of experience, salary)
Let’s look at the internals of a linear regression algorithm. The algorithm will fit a line in the form of y = β0 + β1x to your data. Here we call x the independent variable or input variable and y is called the dependent variable or output variable, β1 represents the slope of the line and β0 represents the intercept of the line. The goal of the algorithm is to find a line that best fits a set of points to minimize the error where the error is the distance from the line to the points.
Fig 2: Describes the error which is the distance between actual value and the estimated regression line.
The line moves according to the parameter values (β0,β1). Similarly, you can perform multiple linear regressions, using multiple input variables, to improve the general applicability of the learning algorithm.
Linear regression algorithm
Fig3 : Showcasing Linear regression model using input variable as Years of Experience and Salary
Hypothesis function refers to the equation h(x) = β0 + β1x which maps input x (years of experience) to target y (salary).
Cost function: The cost function determines how to fit the best line to the data. For hypothesis function h(x) = β0 + β1x choose β0 and β1 so that h(x) is close to y for training examples (x,y). This represents the optimization problem to solve.
Ordinary Least Square method: The regression model uses ordinary least squares methodology which is a summation of square of difference between actual value(y) and the predicted value(y') which can be denoted as ∑[(y - y')]² for all training examples in the dataset to find the best possible value of the regression coefficients β0 and β1.
Optimization algorithm
Gradient descent is one of the methods that can be used to reduce the error, which helps by taking steps in the direction of a negative gradient. Gradient descent is an optimization algorithm that tweaks its parameters iteratively. In machine learning, gradient descent is used to update parameters in a model. Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend, you first will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Gradient descent does exactly the same. The algorithm makes use of learning rate parameter to take steps. Choosing the learning rate is an important step as big learning rate can skip the minima and very small learning rate can take infinite time to reach the minima.
Fig 4: Very big learning rate can skip the minima and very small learning rate might take infinite steps to reach minima
Evaluating model performance
Once the model is built, the metrics used to determine model fit can have different values based on the type of data. Let’s consider a real life example seen in Fig1 where we have details of employee years of experience and salary and we want to find the correlation between them. Once the model is built we calculate R squared score as a metric of performance. If the R Squared score is 0.92 it means that 92% of the variability is covered by the model which is good because the high variability covered tells us that the model is very generalized. Similarly, the following are some metrics you can use to evaluate your regression model:
R Square (Coefficient of Determination): This metric describes the percentage of variance explained by the model. It ranges between 0 and 1. The value depends on the quality of the dataset. Generally higher R² is desirable because it means model is more generalized.
Adjusted R²: R square assumes that every single variable explains the variation in the dependent variable. The adjusted R square tells you the percentage of variation explained by the independent variables that actually affect the dependent variable. A model that includes several predictors will return higher R square values and may seem to be a better fit. However, this result is due to it including more terms.
The adjusted R-squared compensates for the addition of variables and only increases if the new predictor enhances the model above what would be obtained by probability. Conversely, it will decrease when a predictor improves the model less than what is predicted by chance.
Mean Squared Error (MSE): The average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator, lower MSE values are desirable.
Mean Absolute Error (MAE): A measure of difference between two continuous variables. It is robust against the effect of outliers. Again, the lower MAE value the better.
Root Mean Square Error (RMSE): Interpreted as how far, on average, the residuals are from zero. It nullifies the squared effect of MSE by square root and provides the result in original units as data. Again, the lower the better.
Pseudo code for building Linear Regression model using scikit-learn
# Import all the necessary libraries
#import your dataset in the place of example.csv
dataset= pd.read_csv(“example.csv”)
#specify your input and output variables based on your data, varies on different datasets
X= dataset.iloc[:,[2,3]].values
y= dataset.iloc[:,4].values
#Split the data into training and testing set using train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.2,random_state=42)
#Perform feature scaling to transform data to fit into the model
X_train=sc_X.fit_transform(X_train)
X_test=sc._X.fit_transform(X_test)
#Fit linear regression model to the data
from sklearn.linear_model import LinearRegression
classifier=LinearRegression()
classifier.fit(X_train,y_train)
#predict results using testing set
y_pred=classifier.predict(X_test)
#Evaluate your results using metrics(here MSE and R2 score is calculated)
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - y_test) ** 2))
print("R2-score: %.2f" % r2_score(pred , y_test) )
Conclusion
Congratulations! In this article you have learned the basic concepts behind a linear regression model. Learn more about building a linear regression model for predicting house prices using Cloudera Data Science Workbench by completing Building a linear regression model for predicting house prices.
... View more