Title: | Tests for Detecting Irregular Digit Patterns |
---|---|
Description: | Provides statistical tests and support functions for detecting irregular digit patterns in numerical data. The package includes tools for extracting digits at various locations in a number, tests for repeated values, and (Bayesian) tests of digit distributions. |
Authors: | Koen Derks [aut, cre] |
Maintainer: | Koen Derks <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2024-10-31 02:48:52 UTC |
Source: | https://github.com/koenderks/digittests |
digitTests
is an R package providing tests for detecting irregular data patterns.
The package and its analyses are also implemented with a graphical user interface in the Audit module of JASP, a free and open-source statistical software program.
Koen Derks (maintainer, author) | <[email protected]> |
Please use the citation provided by R when citing this package.
A BibTex entry is available from citation("digitTests")
.
Useful links:
The issue page to submit a bug report or feature request.
# Load the digitTests package library(digitTests) ############################################ ### Example 1: Benford's Law #### ############################################ data('sinoForest') distr.test(sinoForest$value, check = 'first', reference = 'benford') ################################### ### Example 2: Repeated Values #### ################################### data('sanitizer') rv.test(sanitizer$value, check = 'lasttwo', method = 'af', B = 1000)
# Load the digitTests package library(digitTests) ############################################ ### Example 1: Benford's Law #### ############################################ data('sinoForest') distr.test(sinoForest$value, check = 'first', reference = 'benford') ################################### ### Example 2: Repeated Values #### ################################### data('sanitizer') rv.test(sanitizer$value, check = 'lasttwo', method = 'af', B = 1000)
This function extracts and performs a Bayesian test of the distribution of (leading) digits in a vector against a reference distribution. By default, the distribution of leading digits is checked against Benford's law.
distr.btest(x, check = 'first', reference = 'benford', alpha = NULL, BF10 = TRUE, log = FALSE)
distr.btest(x, check = 'first', reference = 'benford', alpha = NULL, BF10 = TRUE, log = FALSE)
x |
a numeric vector. |
check |
location of the digits to analyze. Can be |
reference |
which character string given the reference distribution for the digits, or a vector of probabilities for each digit. Can be |
alpha |
a numeric vector containing the prior parameters for the Dirichlet distribution on the digit categories. |
BF10 |
logical. Whether to compute the Bayes factor in favor of the alternative hypothesis (BF10) or the null hypothesis (BF01). |
log |
logical. Whether to return the logarithm of the Bayes factor. |
Benford's law is defined as . The uniform distribution is defined as
.
The Bayes Factor quantifies how much more likely the data are to be observed under
: the digits are not distributed according to the reference distribution than under
: the digits are distributed according to the reference distribution. Therefore,
can be interpreted as the relative support in the observed data for
versus
. If
is 1, there is no preference for either
or
. If
is larger than 1,
is preferred. If
is between 0 and 1,
is preferred. The Bayes factor is calculated using the Savage-Dickey density ratio.
An object of class dt.distr
containing:
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
n |
the number of observations in |
statistic |
the value the chi-squared test statistic. |
parameter |
the degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
the p-value for the test. |
check |
checked digits. |
digits |
vector of digits. |
reference |
reference distribution |
data.name |
a character string giving the name(s) of the data. |
Koen Derks, [email protected]
Benford, F. (1938). The law of anomalous numbers. In Proceedings of the American Philosophical Society, 551-572.
set.seed(1) x <- rnorm(100) # Bayesian digit analysis against Benford's law distr.btest(x, check = 'first', reference = 'benford') # Bayesian digit analysis against Benford's law, custom prior distr.btest(x, check = 'first', reference = 'benford', alpha = 9:1) # Bayesian digit analysis against custom distribution distr.btest(x, check = 'last', reference = rep(1/9, 9))
set.seed(1) x <- rnorm(100) # Bayesian digit analysis against Benford's law distr.btest(x, check = 'first', reference = 'benford') # Bayesian digit analysis against Benford's law, custom prior distr.btest(x, check = 'first', reference = 'benford', alpha = 9:1) # Bayesian digit analysis against custom distribution distr.btest(x, check = 'last', reference = rep(1/9, 9))
This function extracts and performs a test of the distribution of (leading) digits in a vector against a reference distribution. By default, the distribution of leading digits is checked against Benford's law.
distr.test(x, check = 'first', reference = 'benford')
distr.test(x, check = 'first', reference = 'benford')
x |
a numeric vector. |
check |
location of the digits to analyze. Can be |
reference |
which character string given the reference distribution for the digits, or a vector of probabilities for each digit. Can be |
Benford's law is defined as . The uniform distribution is defined as
.
An object of class dt.distr
containing:
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
n |
the number of observations in |
statistic |
the value the chi-squared test statistic. |
parameter |
the degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
the p-value for the test. |
check |
checked digits. |
digits |
vector of digits. |
reference |
reference distribution |
data.name |
a character string giving the name(s) of the data. |
Koen Derks, [email protected]
Benford, F. (1938). The law of anomalous numbers. In Proceedings of the American Philosophical Society, 551-572.
set.seed(1) x <- rnorm(100) # Digit analysis against Benford's law distr.test(x, check = 'first', reference = 'benford') # Digit analysis against custom distribution distr.test(x, check = 'last', reference = rep(1/9, 9))
set.seed(1) x <- rnorm(100) # Digit analysis against Benford's law distr.test(x, check = 'first', reference = 'benford') # Digit analysis against custom distribution distr.test(x, check = 'last', reference = rep(1/9, 9))
Methods defined for objects returned from the distr.test
, distr.btest
, and rv.test
functions.
## S3 method for class 'dt.distr' print(x, digits = getOption("digits"), ...) ## S3 method for class 'dt.rv' print(x, digits = getOption("digits"), ...) ## S3 method for class 'dt.distr' plot(x, ...) ## S3 method for class 'dt.rv' plot(x, ...)
## S3 method for class 'dt.distr' print(x, digits = getOption("digits"), ...) ## S3 method for class 'dt.rv' print(x, digits = getOption("digits"), ...) ## S3 method for class 'dt.distr' plot(x, ...) ## S3 method for class 'dt.rv' plot(x, ...)
x |
an object of class |
digits |
the number of digits to round to. |
... |
further arguments, currently ignored. |
The print
methods simply print and return nothing.
This function extracts the first (and optionally second) or last digits in a vector.
extract_digits(x, check = 'first', include.zero = FALSE)
extract_digits(x, check = 'first', include.zero = FALSE)
x |
a numeric vector. |
check |
location of the digits to extract. Can be |
include.zero |
logical. Whether to include the digit zero in the output. |
A vector of first (and optionally second) or last digits.
Koen Derks, [email protected]
set.seed(1) x <- rnorm(100) # Extract first digits (without zero) extract_digits(x, check = 'first') # Extract last digits (including zero) extract_digits(x, check = 'last', include.zero = TRUE)
set.seed(1) x <- rnorm(100) # Extract first digits (without zero) extract_digits(x, check = 'first') # Extract last digits (including zero) extract_digits(x, check = 'last', include.zero = TRUE)
This function analyzes the frequency with which values get repeated within a set of numbers. Unlike Benford's law, and its generalizations, this approach examines the entire number at once, not only the first or last digit.
rv.test(x, check = 'last', method = 'af', B = 2000)
rv.test(x, check = 'last', method = 'af', B = 2000)
x |
a numeric vector of values from which the digits should be analyzed. |
check |
which digits to shuffle during the procedure. Can be |
method |
which property of the data is calculated. Defaults to |
B |
how many samples to use in the bootstraping procedure. |
To determine whether the data show an excessive amount of bunching, the null hypothesis that x
does not contain an unexpected amount of repeated values is tested against the alternative hypothesis that x
has more repeated values than expected. The statistic can either be the average frequency ( of the data or the entropy (
, with
) of the data. Average frequency and entropy are highly correlated, but the average frequency is often more interpretable. For example, an average frequency of 2.5 means that, on average, your observations contain a value that appears 2.5 times in the data set.To quantify what is expected, this test requires the assumption that the integer portions of the numbers are not associated with their decimal portions.
An object of class dt.rv
containing:
x |
input data. |
frequencies |
frequencies of observations in |
samples |
vector of simulated samples. |
integers |
counts for extracted integers. |
decimals |
counts for extracted decimals. |
n |
the number of observations in |
statistic |
the value the average frequency or entropy statistic. |
p.value |
the p-value for the test. |
cor.test |
correlation test for the integer portions of the number versus the decimals portions of the number. |
method |
method used. |
check |
checked digits. |
data.name |
a character string giving the name(s) of the data. |
Koen Derks, [email protected]
Simohnsohn, U. (2019, May 25). Number-Bunching: A New Tool for Forensic Data Analysis. Retrieved from https://datacolada.org/77.
set.seed(1) x <- rnorm(50) # Repeated values analysis shuffling last digit rv.test(x, check = 'last', method = 'af', B = 2000)
set.seed(1) x <- rnorm(50) # Repeated values analysis shuffling last digit rv.test(x, check = 'last', method = 'af', B = 2000)
Data from a study on factory workers' use of hand sanitizer. Sanitizer use was measured to a 100th of a gram.
data(sanitizer)
data(sanitizer)
A data frame with 1600 rows and 1 variable.
[Retracted] Li, M., Sun, Y., & Chen, H. (2019). The decoy effect as a nudge: Boosting hand hygiene with a worse option. Psychological Science, 30, 139–149.
data(sanitizer)
data(sanitizer)
Financial Statemens numbers of Sino Forest Corporation's 2010 Report.
data(sinoForest)
data(sinoForest)
A data frame with 772 rows and 1 variable.
Nigrini, M. J. (2012). Benford's Law: Application for Forensic Accounting, Auditing and Fraud Detection. Wiley and Sons: New Jersey.
data(sinoForest)
data(sinoForest)