Pages

Showing posts with label Regression analysis. Show all posts
Showing posts with label Regression analysis. Show all posts

Saturday, August 24, 2019

Regression Analysis

Regression Analysis

In statistics, regression analysis consists of techniques for modeling the relationship between a dependent variable (also called response variable) and one or more independent variables (also known as explanatory variables or predictors). In regression, the dependent variable is modeled as a function of independent variables, corresponding regression parameters (coefficients), and a random error term which represents variation in the dependent variable unexplained by the function of the dependent variables and coefficients. In linear regression, the dependent variable is modeled as a linear function of a set of regression parameters and a random error. The parameters need to be estimated so that the model gives the “ best fit ” to the data.

Linear Regression Analysis Theory and Computing

Overview Regression analysis is a process used to estimate a function which predicts the value of the response variable in terms of values of other independent variables. If the regression function is determined only through a set of parameters the type of regression is the parametric regression. Many methods have been developed to determine various parametric relationships between the response variable and independent variables. These methods typically depend on the form of the parametric regression function and the distribution of the error term in a regression model. For example, linear regression, logistic regression, Poisson regression, and probit regression, etc. These particular regression models assume different regression functions and error terms from corresponding underline distributions. A generalization of linear regression models have been formalized in the “ generalized linear model ” and it requires to specify a link function which provides the relationship between the linear predictor and the mean of the distribution function.

>>>CLICK HERE TO VIEW THE PDF FILE<<<

Topics

The topics on regression analysis covered in this book are distributed among 9 chapters. Chapter 1 briefly introduces the basic concept of regression and defines the linear regression model. Chapters 2 and 3 cover the simple linear regression and multiple linear regression. Although the simple linear regression is a special case of the multiple linear regression, we present it without using matrix and give detailed derivations that highlight the fundamental concepts in linear regression. The presentation of multiple regression focus on the concept of vector space, linear projection, and linear hypothesis test. The theory of matrix is used extensively for the proofs of the statistical properties of the linear regression model. Chapters 4 through 6 discuss the diagnosis of a linear regression model. These chapters cover outlier detection, influential observations identification, collinearity, confounding, regression on dummy variables, checking for equal variance assumption, graphical display of residual diagnosis, and variable transformation technique in linear regression analysis. Chapters 7 and 8 provide further discussions on the generalizations of the ordinary least squares estimation in linear regression. In these two chapters, we discuss how to extend the regression model to the situation where the equal variance assumption on the error term fails. To model the regression data with unequal variance the generalized least squares method is introduced. In Chapter 7, two shrinkage estimators, the ridge regression and the LASSO are introduced and discussed. A brief discussion of the least-squares method for nonlinear regression is also included. Chapter 8 briefly introduces generalized linear models. In particular, the Poisson Regression for count data and the logistic regression for binary data are discussed. Chapter 9 briefly discussed the Bayesian linear regression models. The Bayes averaging method is introduced and discussed.

>>>CLICK HERE TO VIEW THE PDF FILE<<<




About the Author:

Ed Neil O. Maratas an instructor of Jose Rizal Memorial State University, Dapitan Campus, Philippines as regular status. He earned his Bachelor of Science in Statistics at Mindanao State University-Tawi-Tawi College of Technology and Oceanography in the year 2003 and finished Master of Arts in Mathematics at Jose Rizal Memorial State University year 2009. He Became a researcher, a data analyst, and engaged to several projects linked to the university as data processor.


Prepared by:ednielmaratas@gmail.com or you can visit the facebook pageStatisticss For Funfor more details about statistics.

ShortcutLInk Here:Visit Ad.fly Website Now
">
>>>Short URL link HERE<<<
Basically, AdFly is a link shortening service and unlike other link shortening services like bit.ly & goo.gl,

Friday, August 2, 2019

Types of Regression Analysis

Types of Regression

What are the types of Regressions? Here they are:

Linear Regression

Logistic Regression

Polynomial Regression

Stepwise Regression

Ridge Regression

Lasso Regression

ElasticNet Regression

1. Linear Regression

It is one of the most widely known modeling technique. Linear regression is usually among the first few topics which people pick while learning predictive modeling. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the nature of the regression line is linear.

Linear Regression establishes a relationship between the dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as a regression line).

It is represented by an equation Y=a+b*X + e, where a is intercept, b is the slope of the line and e is error term. This equation can be used to predict the value of the target variable based on a given predictor variable(s).

2. Logistic Regression

Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can be represented by the following equation.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of the presence of the characteristic of interest. A question that you should ask here is “why have we used to log in the equation?”.

Since we are working here with a binomial distribution (dependent variable), we need to choose a link function which is best suited for this distribution. And, it is a logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).

3. Polynomial Regression

A regression equation is a polynomial regression equation if the power of the independent variable is more than 1. The equation below represents a polynomial equation:

y=a+b*x^2

In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.

4. Stepwise Regression

This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.

This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern significant variables. Stepwise regression basically fits the regression model by adding/dropping co-variates one at a time based on a specified criterion. Some of the most commonly used Stepwise regression methods are listed below:
Standard stepwise regression does two things. It adds and removes predictors as needed for each step.

Forward selection starts with the most significant predictor in the model and adds variable for each step.

Backward elimination starts with all predictors in the model and removes the least significant variable for each step.

The aim of this modeling technique is to maximize the prediction power with a minimum number of predictor variables. It is one of the methods to handle higher dimensionality of data set.

5. Ridge Regression

Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated). In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.

Above, we saw the equation for linear regression. Remember? It can be represented as:

y=a+ b*x

This equation also has an error term. The complete equation becomes:

y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error between the observed and predicted value]

=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.

In a linear equation, prediction errors can be decomposed into two sub-components. First is due to the biased and second is due to the variance. Prediction error can occur due to any one of these two or both components. Here, we’ll discuss the error caused due to variance.

6. Lasso Regression

lasso regression, l1 regularization

Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing the variability and improving the accuracy of linear regression models. Look at the equation below: Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk towards absolute zero. This results in variable selection out of given n variables.

7. ElasticNet Regression

ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2 prior as regularizer. Elastic-net is useful when there are multiple features which are correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

elastic net regression

A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge’s stability under rotation.

How to select the right regression model?

Life is usually simple when you know only one or two techniques. One of the training institutes I know of tells their students – if the outcome is continuous – apply linear regression. If it is binary – use logistic regression! However, higher the number of options available at our disposal, more difficult it becomes to choose the right one. A similar case happens with regression models.

Within multiple types of regression models, it is important to choose the best-suited technique based on the type of independent and dependent variables, dimensionality in the data and other essential characteristics of the data. Below are the key factors that you should practice to select the right regression model:
Data exploration is an inevitable part of building a predictive model. It should be your first step before selecting the right model like identify the relationship and impact of variables

To compare the goodness of fit for different models, we can analyze different metrics like the statistical significance of parameters, R-square, Adjusted r-square, AIC, BIC and error term. Another one is Mallow’s Cp criterion. This essentially checks for possible bias in your model, by comparing the model with all possible submodels (or a careful selection of them).

Cross-validation is the best way to evaluate models used for prediction. Here you divide your data set into two groups (train and validate). A simple mean squared difference between the observed and predicted values give you a measure for the prediction accuracy.

If your data set has multiple confounding variables, you should not choose the automatic model selection method because you do not want to put these in a model at the same time.

It’ll also depend on your objective. It can occur that a less powerful model is easy to implement as compared to a highly statistically significant model.

Regression regularization methods(Lasso, Ridge, and ElasticNet) works well in case of high dimensionality and multicollinearity among the variables in the data set.

By now, I hope you would have got an overview of regression. These regression techniques should be applied considering the conditions of data. One of the best tricks to finding out which technique to use is by checking the family of variables i.e. discrete or continuous.

End note...


Friday, July 26, 2019

Regression Analysis: Overview

Overview

Suppose you’re a sales manager trying to predict next month’s numbers. You know that dozens, perhaps even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and improved model can impact the number. Perhaps people in your organization even have a theory about what will have the biggest effect on sales. “Trust me. The more rain we have, the more we sell.” “Six weeks after the competitor’s promotion, sales jump.” To answer is the use of regression analysis.

While there are many types of regression analysis, at their core they all examine the influence of one or more independent variables on a dependent variable.

What is regression analysis and what does it mean to perform a regression?

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. 

Regression analysis is a way of mathematically sorting out which of those variables does indeed have an impact. It answers the questions: Which factors matter most? Which can we ignore? How do those factors interact with each other? And, perhaps most importantly, how certain are we about all of these factors?


Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.

In order to understand regression analysis fully, it’s essential to comprehend the following terms:
Dependent Variable: This is the main factor that you’re trying to understand or predict. 
Independent Variables: These are the factors that you hypothesize have an impact on your dependent variable.

In our application training example above, attendees’ satisfaction with the event is our dependent variable. The topics covered, length of sessions, food provided, and the cost of a ticket are our independent variables.


How does regression analysis work?

In order to conduct a regression analysis, you’ll need to define a dependent variable that you hypothesize is being influenced by one or several independent variables.

You’ll then need to establish a comprehensive dataset to work with. Administering surveys to your audiences of interest is a terrific way to establish this dataset. Your survey should include questions addressing all of the independent variables that you are interested in.

Let’s continue using our application training example. In this case, we’d want to measure the historical levels of satisfaction with the events from the past three years or so (or however long you deem statistically significant), as well as any information possible in regards to the independent variables. 

Perhaps we’re particularly curious about how the price of a ticket to the event has impacted levels of satisfaction. 

To begin investigating whether or not there is a relationship between these two variables, we would begin by plotting these data points on a chart, which would look like the following theoretical example.


(Plotting your data is the first step in figuring out if there is a relationship between your independent and dependent variables)

Our dependent variable (in this case, the level of event satisfaction) should be plotted on the y-axis, while our independent variable (the price of the event ticket) should be plotted on the x-axis.

Once your data is plotted, you may begin to see correlations. If the theoretical chart above did indeed represent the impact of ticket prices on event satisfaction, then we’d be able to confidently say that the higher the ticket price, the higher the levels of event satisfaction. 

But how can we tell the degree to which ticket price affects event satisfaction?

To begin answering this question, draw a line through the middle of all of the data points on the chart. This line is referred to as your regression line, and it can be precisely calculated using a standard statistics program like Excel.

We’ll use a theoretical chart once more to depict what a regression line should look like.


The regression line represents the relationship between your independent variable and your dependent variable. 

Excel will even provide a formula for the slope of the line, which adds further context to the relationship between your independent and dependent variables. 

The formula for a regression line might look something like Y = 100 + 7X + error term.

This tells you that if there is no “X”, then Y = 100. If X is our increase in ticket price, this informs us that if there is no increase in ticket price, event satisfaction will still increase by 100 points. 

You’ll notice that the slope formula calculated by Excel includes an error term. Regression lines always consider an error term because in reality, independent variables are never precisely perfect predictors of dependent variables. This makes sense while looking at the impact of ticket prices on event satisfaction — there are clearly other variables that are contributing to event satisfaction outside of price.

Your regression line is simply an estimate based on the data available to you. So, the larger your error term, the less definitively certain your regression line is.

Why should your organization use regression analysis?


Regression analysis is helpful statistical method that can be leveraged across an organization to determine the degree to which particular independent variables are influencing dependent variables.

The possible scenarios for conducting regression analysis to yield valuable, actionable business insights are endless.

The next time someone in your business is proposing a hypothesis that states that one factor, whether you can control that factor or not, is impacting a portion of the business, suggest performing a regression analysis to determine just how confident you should be in that hypothesis! This will allow you to make more informed business decisions, allocate resources more efficiently, and ultimately boost your bottom line.