 Linear regression is a special form of the function approximation, to model a given set of random variables.
 Linear regression is about finding relationship between set of random variables to be able to predict new value.

In linear regression we’ve the following scenario:
 One of the random variables is considered to be of particular interest; that random variable is referred to as a dependent variable or response.
 The remaining random variables are called independent variables, or regressors; their role is to explain or predict the statistical behavior of the response.
 The dependence of the response on the regressors includes an additive error term, to account for uncertainties in the manner in which this dependence is formulated. This is called expectational error or explanational error.
 Such model is called regress model.
 In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an illposed problem or to prevent overfitting.

Classes of regression models:
 Linear: The dependence of the response on the regressors is defined by linear function.
 NonLinear: The dependence of the response on the regressors is defined by nonlinear function.

Methods for tracing the linear regression model:

Bayesian Theory:
 To derive the maximum a posteriori estimate of the vector that parameterizes a linear regression model.

Method of Least Square:
 This is the oldest parameter estimation procedure.
 It was introduced by Gauss in the early part of 19^{th} century.

 Model order is common dimension between regressors and fixed weights.

Problem of interest here is stated as follows:
 Given the joint statistics of the regressor X and the corresponding respinse D, estimate the unknown parameter vector w.

Maximum a posteriori (MAP) is more profound than maximumlikelihood (ML) estimation because:
 MAP exploits all the conceivable information about parameter vector w.
 ML relies solely on the observation model (d, x) and may therefore lead to a nonunique solution. To enforce uniqueness and stability the prior has to be incorporated into the formulation of the of the estimator.

Assumptions considered in parameter estimation process:

Samples are statistically independent and identically distributed.
 Statistically independent meant that there’s no covariance matrix.
 Identically distributed means that values are on diagonal.

Gaussianity.
 The environment responsible for generation of the training sampls is Gaussions distribution.

Stationary.
 The environment is stationary, which means that the parameter vector w is fixed, but unkown, throughout the N trails of the experiment.

 103 >106Finding MAP of weight vector.
 In improving the stability of the maximum likelihood estimator through the use of regularization (i.e. the incorporation of prior knowledge), the resulting maximum a posteriori estimator becomes biased.
 In short, we have a tradeoff between stability and bias.
 Ordinary least square estimator is defined as the squared expectational errors summed over the experimental trails on the environment.
 When Regularization Parameter (=0) that means we have complete confidence in the observation model exemplified by the training samples. At the other extreme if (=) that means we have no confidence at the observation model.
 Regularized Least Square Solution (RLS) is referred as minimizing the cost function with respect to parameter w.
 This solution is identical to MAP estimate solution.
 The representation of a stochastic process by a linear model may be used for synthesis or analysis.

In synthesis, we generate a desired time series by assigning a formulated set of values to parameters of the model and feeding it with while noise (with zero mean and variance). The obtained model is called generative model.
 Note: I bet that synthesis is similar to simulation; is that right or wrong?!
 In analysis, we estimate the parameters of the model by processing a given time series of finite length, using the Bayesian approach or regularized method of least square.
 In estimation, we need to pick best suitable model to process data. This is called model selection problem.
 MinimumDescriptionLength (MDL) principle is model selection method pioneered by Rissanen.

MDL was inspired from:

Kolmogorov Complexity Theory. Kolmogorov defined complexity as follows
 “The algorithmic complexity of a data sequence is the length of the shortest binary computer program that prints out the sequence and then halts”.
 Regularity itself may be identified with the “ability to compress”.

 MDL states: “Given a set of hypothesis, and a data sequence d, we should try to find the particular hypothesis or some combination of hypothesis in, that compresses the data sequence d the most”.

MDL principle has many versions one of the oldest and simplistic versions is called simplistic twopart code MDL principle for probabilistic modeling.
 By “simplistic” we mean that the codelengths under consideration are not determined in optimal fashion.
 By “code” and “codelengths” used to pertain to the process of encoding the data sequence in the shortest or least redundant manner.
 111in MDL we are interested to identify the model that best explains an unknown environment that is responsible for generating training sample , where x_{i} is the stimulus and d_{i} is the corresponding response.
 This above description is called modelorder selection problem.
 111Description of MDL selection process.

Attributes of MDL Principle:
 The MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories: “Accept the simplest explanation that fits data”.
 The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases.

Questions:

99What is meant by stationary environment?
 The statistics of the system don’t change with time.

100Why in joint statistics we need correlation matrix of regressors and variances of D?
 To exploit the dependency of the response on the variances of the regressors and variances of D.

100Why X and D means are assumed to be zero?
 To normalize data and put them in the same space.

101What’s meant by: P(w, dx), P(w d, x).
 P(w, dx): Joint Probability of w & d given x. Where x is given and w & d are unknown.
 P(w d, x): probability of w given joint probability of d and x. Where both d and x are known.
 101Why we’ve made the series of these equations?
 101How equation 2.5 is valid?
 102What’s meant by equation 2.10, 2.11

103What’s meant by identically distributed? A life example.
 Follows same distribution type.

103What’s meant by singleshot realization?
 Instantiation value.
 103Why expectational error is with zero mean? (Assumption #2)

104Equation 2.19; why there’s no division on the number of the elements?
 Because the data are already normalized.

107Why the ½ appeared?
 To simplify calculations and takes average of error.
 107How OLSE is same as MLE?
 Because they have the same cost equation.

107What’s the problem of “having distinct possibility of obtaining a solution that lacks uniqueness and stability”?
 Because the estimation is only based on the observed data.

107How “This solution is identical to MAP estimate solution”?!
 Because they have the same cost equation.
 110Is MDL is similar to MLE?!
 110What we mean by “p should be encoded” (last paragraph).
 111What’s Occam’s razor theorem? (in short)

[…] This post was mentioned on Twitter by Abdelrahman AlOgail. Abdelrahman AlOgail said: has a new blog post "Model Building through Regression": http://bit.ly/cLiWCh […]