Giter VIP home page Giter VIP logo

Comments (23)

mattansb avatar mattansb commented on July 1, 2024 1

Okay, I cooked up this example that shows the lack of agreement between univariable, multivriable, and model-based methods. I'm sure if I played with this longer, I could make them overlap less. But maybe this is enough?

Not sure how to build a not confusing legend here 🤷‍♂️

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(performance)
#> Warning: package 'performance' was built under R version 4.3.2

update_geom_defaults("point", aes(size = 3))

theme_set(
  theme_bw()  
)

# Data --------------------------------------------------------------------

data <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                         11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                         21, 22, 23, 24, 25, 26, 27, 28, 29, 60), 
                   y = c(-2, 0, 2, 6, 5, 7, 30, 8, 9, 10,
                         11, 13, 14, 13, 15, 16, 17, 17, 19, 18, 
                         21, 23, 21, 24, 24, 26, 27, 30, 27, 61))


# Outlier detection -------------------------------------------------------

# Univariate methods
data$univ_outlier <- check_outliers(data, method = c("zscore"))

# Multivariate methods
data$multiv_outlier <- check_outliers(data[,1:2], method = c("mahalanobis"))

# Model-specific methods
model <- lm(y ~ x, data = data)

data$model_outlier <- check_outliers(model, method = "cook")

# Plot ---------------------------------------

data
#>     x  y univ_outlier multiv_outlier model_outlier
#> 1   1 -2        FALSE          FALSE         FALSE
#> 2   2  0        FALSE          FALSE         FALSE
#> 3   3  2        FALSE          FALSE         FALSE
#> 4   4  6        FALSE          FALSE         FALSE
#> 5   5  5        FALSE          FALSE         FALSE
#> 6   6  7        FALSE          FALSE         FALSE
#> 7   7 30        FALSE           TRUE          TRUE
#> 8   8  8        FALSE          FALSE         FALSE
#> 9   9  9        FALSE          FALSE         FALSE
#> 10 10 10        FALSE          FALSE         FALSE
#> 11 11 11        FALSE          FALSE         FALSE
#> 12 12 13        FALSE          FALSE         FALSE
#> 13 13 14        FALSE          FALSE         FALSE
#> 14 14 13        FALSE          FALSE         FALSE
#> 15 15 15        FALSE          FALSE         FALSE
#> 16 16 16        FALSE          FALSE         FALSE
#> 17 17 17        FALSE          FALSE         FALSE
#> 18 18 17        FALSE          FALSE         FALSE
#> 19 19 19        FALSE          FALSE         FALSE
#> 20 20 18        FALSE          FALSE         FALSE
#> 21 21 21        FALSE          FALSE         FALSE
#> 22 22 23        FALSE          FALSE         FALSE
#> 23 23 21        FALSE          FALSE         FALSE
#> 24 24 24        FALSE          FALSE         FALSE
#> 25 25 24        FALSE          FALSE         FALSE
#> 26 26 26        FALSE          FALSE         FALSE
#> 27 27 27        FALSE          FALSE         FALSE
#> 28 28 30        FALSE          FALSE         FALSE
#> 29 29 27        FALSE          FALSE         FALSE
#> 30 60 61         TRUE           TRUE         FALSE

data <- data |> 
  mutate(
    any_outlier = interaction(model_outlier, multiv_outlier, univ_outlier)
  )

b <- coef(model)

ol_name <- "Outlier Type"
ol_labels <- c("(Not)", "Multivariable or Model", "Multivariable or Univariable")

ggplot(data, aes(x, y)) + 
  geom_abline(intercept = b[1], slope = b[2],
              linewidth = 1, color = "royalblue") + 
  geom_point(aes(color = any_outlier, shape = any_outlier)) + 
  scale_shape(ol_name, labels = ol_labels) + 
  scale_color_discrete(ol_name, labels = ol_labels)

Created on 2023-12-19 with reprex v2.0.2

from performance.

mattansb avatar mattansb commented on July 1, 2024 1

Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies

This doesn't mean that any method is wrong, per se. I might be biased, but (as I made clear in my first pass on the draft) all these methods should merely be used as suggestive since, objectively, there generally isn't a ground truth (which is also why I personally prefer a non-automated, knowledge-based outlier inspection/rejection).

Thus, different methods can be judged by their usefulness to do ... something.

  • Univariate methods are often good to detecting non-representative values, or data-coding errors.
  • Multivariate methods are also good at detecting non-representative values in a joint-distribution sense.
  • Model based methods are good for detecting values that might unrealistically bias model inference.

But of course the data is the data, in real heavy tailed distributions, especially in small samples, all of these methods can result in falsely flagging actual representative values (which IMO is the point of outlier detection).

Here is a random sample from a true DGP of $y \sim Cauchy(x, 1)$ in which all methods flag the same observation.

{Code and plot of the example}
library(performance)
#> Warning: package 'performance' was built under R version 4.3.2
library(ggplot2)

update_geom_defaults("point", aes(size = 3))

theme_set(
  theme_bw()  
)

set.seed(42)
data <- tibble::tibble(
  x = rnorm(30),
  y = x + rcauchy(30)
)

# Outlier detection -------------------------------------------------------

# Univariate methods
data$univ_outlier <- check_outliers(data, method = c("zscore"))

# Multivariate methods
data$multiv_outlier <- check_outliers(data[,1:2], method = c("mahalanobis"))

# Model-specific methods
model <- lm(y ~ x, data = data)

data$model_outlier <- check_outliers(model, method = "cook")

data <- data |> 
  dplyr::mutate(
    any_outlier = interaction(model_outlier, multiv_outlier, univ_outlier)
  )


b <- coef(model)

ol_name <- "Outlier Type"
ol_labels <- c("(Not)", "Multivariable and Univariable and Model")

ggplot(data, aes(x, y)) + 
  geom_abline(intercept = b[1], slope = b[2],
              linewidth = 1, color = "royalblue") + 
  geom_point(aes(color = any_outlier, shape = any_outlier)) + 
  scale_shape(ol_name, labels = ol_labels) + 
  scale_color_discrete(ol_name, labels = ol_labels)

Created on 2023-12-20 with reprex v2.0.2

So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.

WDYT?

from performance.

DominiqueMakowski avatar DominiqueMakowski commented on July 1, 2024 1

i'm quite swamped right now, but I can look into the better example than the questionnaire issue. For the Bayesian question, I'm not sure what the reviewer is talking about I need to read this Ciccione (2023) first, I'll add that on my to-do list

from performance.

DominiqueMakowski avatar DominiqueMakowski commented on July 1, 2024 1

Maybe it is the "aggregate" term that is confusing. Maybe we could rename it as a "Highest deviation per participant" or something like that because it's not really aggregating but rather showing the most extreme

from performance.

rempsyc avatar rempsyc commented on July 1, 2024 1

Ok, congrats all, we've managed to address almost all issues raised by reviewers 🥳 the only thing left is the two points assigned to Dom 😛 We'll be able to resubmit as soon as Dom gets to it

from performance.

rempsyc avatar rempsyc commented on July 1, 2024 1

Just sent you the email with Google doc link again ;)

from performance.

DominiqueMakowski avatar DominiqueMakowski commented on July 1, 2024 1

I wrote something for the second issue, but for the first one it might require adding a more general parapraph on regularization if I'm understanding correctly (cf. my comment in the answers google docs)

from performance.

rempsyc avatar rempsyc commented on July 1, 2024 1

Congrats team, we've addressed all points 😙 (thanks Dom for this last sprint!). @strengejacke, would you like to review the response to reviewers? With your blessing (and perhaps of the paper as well), I can then submit on our behalf 🤓

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

As you will notice from the email transcription above, we cannot resubmit as a LaTeX file, and will indeed have to move to a Word processor file while monitoring changes using a blue font rather than track-change. I will send the link to the Google Doc by email, but we can decide to communicate here if desired.

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

@bwiernik, our footnote 5 on Cook method's default threshold goes:

Our default threshold for the Cook method is defined by stats::qf(0.5, ncol(x), nrow(x) - ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods.

Reviewer 2 writes,

Footnote 5: please explain why you use here the value 0.5, this is not clear to me.

I believe you were the one to suggest this threshold. What would you suggest adding to the footnote to answer the reviewer's concern?

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

@strengejacke and @IndrajeetPatil, is there anything from the checklist you would like to tackle/get assigned?

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

@DominiqueMakowski, Reviewer 2 writes:

I do not find the random questionnaire example convincing, please look out for a better example in section 2.2.

We currently have:

However, in many scenarios, variables of a data set are not independent, and an abnormal observation will impact multiple dimensions. For instance, a participant giving random answers to a questionnaire. In this case, computing the z score for each of the questions might not lead to satisfactory results. Instead, one might want to look at these variables together.

One common approach for this is to compute multivariate distance metrics such as the Mahalanobis distance.

Looking back in the commit history, you were the one to add this example, so I am assigning this point to you.

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

@mattansb, Reviewer 2 writes,

Lines 97-99: What do you mean by t-tests being multivariate? If I consider a one-sample t-test, what is not univariate there? Also, I find the word multivariable weird, should it not be multivariate?

We have:

However, univariate methods can give false positives since t tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes.

This was based on an early comment from you:

<!!-- MSB: t-tests and correlations are model/multivariable statistics, so univariate outlier methods might give false-positives... -->

So I am assigning this point to you.

from performance.

mattansb avatar mattansb commented on July 1, 2024

First, I would invite the authors to extend a little bit their introduction in order to underline the problematic ways researchers currently deal with outliers. For example, the authors could briefly introduce a "made-up" or real example of a dataset for which different types of outliers are identified according to different methods and/or the different possibilities in which they could be treated.

This is a great idea. Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers() docs.

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers() docs.

I got a couple open raw data sets on OSF, but most are experimental rather than cross-sectional, not sure if they would be suitable for what you had in mind—we could take one of those if there are no better suggestions...

  1. Data set 1 (experimental) (paper)
  2. Data set 2 (experimental) (paper)
  3. Data set 3 (experimental) (paper)
  4. Data sets 4-5 and 6 (experimental) (paper)
  5. Data set 7 (cross-sectional)
  6. Data set 8 (cross-sectional)

from performance.

strengejacke avatar strengejacke commented on July 1, 2024

Do we already have a response letter document?

from performance.

mattansb avatar mattansb commented on July 1, 2024

I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

Do we already have a response letter document?

We do now! I just sent it by email :)

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...

The code is actually just above Figure 4, but on the previous page (in the paper and google doc), but it is just 4 lines of code. Because our example was about height and weight, I used a base R dataset that had precisely those variables and just added artificial outliers. That said, although your code is longer, your figure is prettier because of the legend and geom shapes.

One issue I have with this reviewer's comment is that, as you point out, we already do this comparison in the relevant section (Cook’s Distance vs. MCD), after explaining the methods. I feel like going into an extensive method comparison at the very beginning before having introduced the methods would be a bit out of order.

I guess he just wants an example of a clearly wrong but common approach to outlier detection. I think it would be mostly to support our assertion that researchers treat outliers with incorrect strategies:

Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies

So we could give an example of a researcher who uses the commonly used +/3 SD, and how it identified an outlier when it shouldn't have, and missed an actual outlier.

But how much overlap should there be with the height and weight example? Should we swap them places? Should we only use code without a figure? If we do swap them and include the figure, perhaps in the Cook’s Distance vs. MCD section we could simply refer back to the example from the intro? I started a short paragraph draft in the paper to get us thinking.

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

Woooaw, @mattansb your new Figure 1 in the paper is amazing!!!! Should be in a textbook! But this outlet is good too ;)

The caption is long but very good I think... It is quite detailed for something coming on the third paragraph of the paper (with all the thresholds etc.), but at the same time I think it setups the rest of the paper, and this is exactly what Reviewer 1 asked.

I think I first wrote the paper with the Leys/Lakens papers in mind, which have strong titles like "Do not use standard deviation around the mean, use absolute deviation around the median" and which include statements such as (in the abstract) "this method is problematic."

Now, we might decide to tune down the tone of the paper to clarify that no method is wrong per se, and instead invite researchers to be more mindful of the selected method.

So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.

I thought we already kind of did this, but after rereading the paper, it seems we don't! I think it is important that you can capture all (or most) of your thoughts/feelings about outliers in this paper since it might become a reference, so let's do it. If you wanted, we could make this its own section (you suggest placing it before the Handling Outliers section), and you could even include your Cauchy code example (if you find it useful). You will see in the paper for now I've added a temporary section called "Are Some Methods “Wrong”?", feel free to improve it :)

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

@DominiqueMakowski do you think you'll be able to tackle Reviewer 1's comment about Bayesian stats soon? I'm hoping to resubmit the paper by the end of January. Let me know your timeline and if you think this could be possible.

from performance.

rempsyc avatar rempsyc commented on July 1, 2024

Reviewer 2 comments,

In Figure 1 I find it weird to see an aggregate score, please explain this better.

Here's my attempt to explain the aggregate score as seen on the figure (now Figure 2):

Note. The distance represents an aggregate score for variables mpg, cyl, disp, and hp. In this case, the aggregate score represents a given participant’s (1-34) highest robust z score among the tested variables. The resulting unique value (representing one of mpg, cyl, disp, or hp for that participant) is then rescaled to a range of 0 to 1 by dividing by the value of the participant with the highest score.

from performance.

DominiqueMakowski avatar DominiqueMakowski commented on July 1, 2024

Well done, can you confirm where is the latest version so that I can take a stab at it?

from performance.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.