# Derivative of binary logistic regression We wrap the sigmoid function over the same prediction function we used in multiple linear regression. Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum. Cross-entropy loss can be divided into two separate cost functions: These smooth monotonic functions  always increasing or always decreasing make it easy to calculate the gradient and minimize cost.

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions!

The corollary is increasing prediction accuracy closer to 0 or 1 has diminishing returns on reducing cost due to the logistic nature of our cost function.

In both cases we only perform the operation we need to perform. To minimize our cost, we use Gradient Descent just like before in Linear Regression. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things! One of the neat properties of the sigmoid function is its derivative is easy to calculate.

Michael Neilson also covers the topic in chapter 3 of his book. Notice how this gradient is the same as the Mean Squared Error gradient, the only difference is the hypothesis function. Our training code is the same as we used for linear regression. Accuracy measures how correct our predictions were.

In this case we simple compare predicted labels to true labels and divide by the total. Another helpful technique is to plot the decision boundary on top of our predictions to see how our labels compare to the actual labels. This involves plotting our predicted probabilities and coloring them with their true labels. Basically we re-run binary classification multiple times, once for each class.

Then we take the class with the highest predicted value. Linear Regression and logistic regression can predict different things: Linear regression predictions are continuous numbers in a range. Logistic Regression could help use predict whether the student passed or failed. Unfortunately, most derivations like the ones in [Agresti, ] or [Hastie, et.

To make the discussion easier, we will focus on the binary response case. The logistic regression model assumes that the log-odds of an observation y can be expressed as a linear function of the K input variables x:. The left hand side of the above equation is called the logit of P hence, the name logistic regression.

This immediately tells us that logistic models are multiplicative in their inputs rather than additive, like a linear model , and it gives us a way to interpret the coefficients.

If x j is a binary variable say, sex, with female coded as 1 and male as 0 , then if the subject is female, then the response is two times more likely to be true than if the subject is male, all other things being equal. The right hand side of the top equation is the sigmoid of z , which maps the real line to the interval 0, 1 , and is approximately linear near the origin. Later, we will want to take the gradient of P with respect to the set of coefficients b , rather than z.

The solution to a Logistic Regression problem is the set of parameters b that maximizes the likelihood of the data, which is expressed as the product of the predicted probabilities of the N individual observations. Maximizing the log-likelihood will maximize the likelihood. It is analogous to the residual sum of squares RSS of a linear model. Ordinary least squares minimizes RSS; logistic regression minimizes deviance. A useful goodness-of-fit heuristic for a logistic regression model is to compare the deviance of the model with the so-called null deviance: One minus the ratio of deviance to null deviance is sometimes called pseudo-R 2 , and is used the way one would use R 2 to evaluate a linear model.

Traditional derivations of Logistic Regression tend to start by substituting the logit function directly into the log-likelihood equations, and expanding from there. To maximize the log-likelihood, we take its gradient with respect to b:. The maximum occurs where the gradient is zero. We can now cancel terms and set the gradient to zero.

This gives us the set of simultaneous equations that are true at the optimum:. Notice that the equations to be solved are in terms of the probabilities P which are a function of b , not directly in terms of the coefficients b themselves.

This means that logistic models are coordinate-free: Only the values of the coefficients will change. The other thing to notice from the above equations is that the sum of probability mass across each coordinate of the x i vectors is equal to the count of observations with that coordinate value for which the response was true. For example, suppose the jth input variable is 1 if the subject is female, 0 if the subject is male.

This is what we mean when we say that logistic regression preserves the marginal probabilities of the training data. Suppose you have a vector valued function f: Assuming that we start with an initial guess b 0 , we can take the Taylor expansion of f around b