How does one find outliers, influential
observations?
How about prediction intervals, confidence
intervals?
https://www.sagepub.com/sites/default/files/upm-binaries/21121_Chapter_15.pdf
[Textbook]
https://icml.cc/2012/papers/80.pdf [2012
Fast approximation of matrix coherence and statistical leverage]
Firstly, we have to derive the hat matrix
for GLMs. We find that:
|
Also notice the sum of the leverage scores is equal to the number
of parameters in the model.
[BE CAREFUL NOT ALWAYS!!! Ie like in
Ridge Regression]
However
be careful! The TRUE hat matrix is NOT the formula above, but rather:
|
Ive
confirmed the norm between them is off by around p - rank(H).
However, since we only want the leverage scores, we use the trace
property cycling property.
|
However, on further notice, we notice via the Cholesky
Decomposition that:
|
All in all, this optimized routine computes both the
variances(beta) and leverage(X) optimally in around:
np^2 + 1/3p^3 + 1/2p^3 + p^2 + np^2 + np + n FLOPS.
Or around:
(2/3)p^3 + (2n)p^2 + (n+1)p FLOPS.
And peak memory consumption:
n/Bp + p^2 + n + p SPACE.
Where B is the batch size.
Now going about estimating the leverage scores or the diagonal hat
matrix is a pain. One clearly will be crazy on computing
the full hat matrix then only extracting the diagonals.
Instead sketching comes to the rescue again!
|
Also notice the maximum leverage score must be 1 [leverage scores
are between 0 and 1.]
Also, notice the scaling of the normal distribution sketch.
Now the sketching matrices can be the count sketch matrix.
S2 should NOT be a count sketch matrix, since during the
triangular solve.
Notice the error rate. That means in general if the sketching size
requires too many observations,
you would rather use the original matrix.
Pearson residuals are
the per observation component of the Pearson goodness of fit test ie:
|
Where we have for the exponential family:
Name |
Distribution |
Range |
|
|
Poisson |
|
|
|
|
Quasi Poisson |
|
|
|
|
Gamma |
|
|
|
|
Bernoulli |
|
|
|
|
Inverse Gaussian |
|
|
|
|
Negative
Binomial |
|
|
|
|
Standardized Pearson Residuals
then correct for the leverage scores:
[These are also called internally studentized residuals]
|
For practical models:
Name |
Range |
|
|
|
Poisson |
|
|
|
|
Quasi Poisson |
|
|
|
|
Linear |
|
|
|
|
Log Linear |
|
|
|
|
ReLU Linear |
|
|
|
|
CLogLog Logistic |
|
|
|
|
Logistic |
|
|
|
|
The PRESS residuals showcase the difference in the
prediction and true value if we
excluded the observation from the model.
These can be proved by using the Sherman Morrison Inverse
Identity formula.
|
However, in outlier analysis, one generally considers the externally
studentized residuals,
which essentially is when the variance is estimated whenever each
observation is deleted it.
Minus 1 because one datapoint is deleted each time.
|
Now to simplify this, we utilize the PRESS residuals:
[ Gray part skips steps which are a bit tedious to showcase 😊 ]
|
Then the externally studentized residuals utilises this new
excluded variance:
|
However in GLMs, we cannot resort to this estimate for the OLS
model.
Rather, we first remember the Deviance Residuals:
|
The standardized deviance residuals are then:
|
Then, we get the Williams approximation to externally
studentized residuals:
|
Which we find that:
|
For practical models:
Name |
|
|
|
|
Poisson |
|
|
|
|
Quasi Poisson |
|
|
|
|
Linear |
|
|
|
|
Log Linear |
|
|
|
|
ReLU Linear |
|
|
|
|
|
|
|
|
|
Logistic |
|
|
|
|
Popular methods to detect influential or possible outliers include
the Cooks Distance and DFFITS.
|
The Cooks Distance is essentially the total change in MSE if an
observation is removed.
It shows how much error can be seen in standard deviation units on
how the error changes
if an observation is removed.
|
First notice by using the Sherman Morrison inverse formula that:
|
We then expand:
|
Generally, a cutoff of 4/n is reasonable. Likewise, one can use
the 50% percentile of the F distribution.
Though for large n, this will converge to 1.
On the other hand, DFFITS is a similar measure to the Cooks
Distance.
DFFITS rather uses the externally studentized residual, which is
more preferred than the Cooks Distance.
|
In fact, DFFITS is closely related to the Cooks Distance:
|
The cutoff value for DFFITS is:
|
Finally to plot and showcase the possible outliers, one can use
for both axes:
|
However in practice, the identification of outliers is more
complicated than using DFFITS, since DFFITS can suggest
too many outlier candidates. Rather we use the Studentized
Residuals vs Leverage plot:
The cutoff is 3 * the studentized residual (ie the residual is 3
standard deviations away from the mean).
Likewise the leverage much be at least 2 times the average
leverage.
|
However, we need to also capture wildly off-putting results. So, we
also include anything above 5 standard deviations.
|
Confidence and Prediction Intervals for both new data and old data
are super important aspects.
The trick is one utilises the leverage scores.
Remember that:
|
This essentially means that:
|
Notice how this is NOT the leverage
score!!!!!!
And then for GLMs, via the inverse link function (the activation
function):
|
Then, we have the upper
(97.5%) and lower (2.5%) confidence
intervals:
|
Be careful! Its the upper
(97.5%) and lower (2.5%) confidence
intervals! Not 95%.
For new data points though, we have to estimate the factor. We can
do this via:
[Where A is the new data matrix, psi is the new predicted value
and z is the new weight]
|
Remember though previously we performed the Cholesky Decomposition
on the variance matrix.
So, we can speed things up by noticing:
[Notice we exclude the new weights]
|
For Prediction Intervals, we shall not prove why, but there is in
fact a +1 factor inside the leverage square root:
|
Notice if the dispersion parameter is 1 (Poisson, Bernoulli
models), then the normal distribution is
used instead of the Student T distribution for the critical
values.
Copyright Daniel Han 2024. Check out Unsloth!