dbreg – dbreg

Run a regression on a database backend

Description

Leverages the power of databases to run regressions on very large datasets, which may not fit into R’s memory. Various acceleration strategies allow for highly efficient computation, while robust standard errors are computed from sufficient statistics.

Usage

dbreg(
  fml,
  conn = NULL,
  table = NULL,
  data = NULL,
  path = NULL,
  vcov = c("iid", "hc1"),
  strategy = c("auto", "compress", "moments", "demean", "within", "mundlak"),
  compress_ratio = NULL,
  compress_nmax = 1e+06,
  cluster = NULL,
  ssc = c("full", "nested"),
  sql_only = FALSE,
  data_only = FALSE,
  drop_missings = TRUE,
  verbose = getOption("dbreg.verbose", FALSE),
  ...
)

Arguments

fml A formula representing the relation to be estimated. Fixed effects should be included after a pipe, e.g fml = y ~ x1 + x2 | fe1 + f2. Interaction terms are supported using standard R syntax (: for interactions, * for main effects plus interaction). Transformations and literals are not yet supported.

conn Database connection, e.g. created with dbConnect. Can be either persistent (disk-backed) or ephemeral (in-memory). If no connection is provided, then an ephemeral duckdb connection will be created automatically and closed before the function exits. Note that a persistent (disk-backed) database connection is required for larger-than-RAM datasets in order to take advantage of out-of-core functionality like streaming (where supported).

table, data, path

Mutually exclusive arguments for specifying the data table (object) to be queried. In order of precedence:

table: Character string giving the name of the data table in an existing (open) database connection.
data: R dataframe that can be copied over to conn as a temporary table for querying via the DuckDB query engine. Ignored if table is provided.
path: Character string giving a path to the data file(s) on disk, which will be read into conn. Internally, this string is passed to the FROM query statement, so could (should) include file globbing for Hive-partitioned datasets, e.g. “mydata/**/.*parquet”. For more precision, however, it is recommended to pass the desired database reader function as part of this string, e.g. “read_parquet(’mydata/**/*.parquet’)“ for DuckDB; note the use of single quotes. Ignored if either table or data is provided.

vcov Character string or formula denoting the desired type of variance- covariance correction / standard errors. Options are “iid” (default), “hc1” (heteroskedasticity-consistent), or a one-sided formula like ~cluster_var for cluster-robust standard errors. Note that “hc1” and clustered SEs require a second pass over the data unless strategy = “compress” to construct the residuals.

strategy Character string indicating the preferred acceleration strategy. The default “auto” will pick an optimal strategy based on internal heuristics. Users can also override with one of the following explicit strategies: “compress”, “demean” (alias: “within”), “mundlak”, or “moments”. See the Acceleration Strategies section below for details.

compress_ratio, compress_nmax

Numeric(s). Parameters that help to determine the acceleration strategy under the default “auto” option.

compress_ratio defines the compression ratio threshold, i.e. numeric in the range [0,1] defining the minimum acceptable compressed versus the original data size. Default value of NULL means that the threshold will be automatically determined based on some internal heuristic (e.g., 0.01 for models without fixed effects).
compress_nmax defines the maximum allowable size (in rows) of the compressed dataset that can be serialized into R. Pays heed to the idea that big data serialization can be costly (esp. for remote databases), even if we have achieved good compression on top of the original dataset. Default value is 1e6 (i.e., a million rows).

See the Acceleration Strategies section below for further details.

cluster Optional. Provides an alternative way to specify cluster-robust standard errors (i.e., instead of vcov = ~cluster_var). Either a one-sided formula (e.g., ~firm) or character string giving the variable name. Only single-variable clustering is currently supported.

ssc Character string controlling the small-sample correction for clustered standard errors. Options are “full” (default) or “nested”. With “full”, all parameters (including fixed effect dummies) are counted in K for the CR1 correction. With “nested”, fixed effects that are nested within the cluster variable are excluded from K, matching the default behavior of fixest::feols. Only applies to “compress” and “demean” strategies (Mundlak uses explicit group mean regressors, not FE dummies). This distinction only matters for small samples. For large datasets (dbreg’s target use case), the difference is negligible and hence we default to the simple “full” option.

sql_only Logical indicating whether only the underlying compression SQL query should be returned (i.e., no computation will be performed). Default is FALSE.

data_only Logical indicating whether only the compressed dataset should be returned (i.e., no regression is run). Default is FALSE.

drop_missings Logical indicating whether incomplete cases (i.e., rows where any of the dependent, independent or FE variables are missing) should be dropped. The default is TRUE, according with standard regression software. It is strongly recommended not to change this value unless you are absolutely sure that your data have no missings and you wish to skip some internal checks. (Even then, it probably isn’t worth it.)

verbose Logical. Print auto strategy and progress messages to the console? Defaults to FALSE. This can be overridden for a single call by supplying verbose = TRUE, or set globally via options(dbreg.verbose = TRUE).

… Additional arguments. Currently ignored, except to handle superseded arguments for backwards compatibility.

Value

A list of class "dbreg" containing various slots, including a table of coefficients (which the associated print method will display).

Acceleration Strategies

dbreg offers four primary acceleration strategies for estimating regression results from simplified data representations. Below we use the shorthand Y (outcome), X (explanatory variables), and FE (fixed effects) for exposition purposes:

“compress”: compresses the data via a GROUP BY operation (using X and the FE as groups), before running weighted least squares on this much smaller dataset:

\(\hat{\beta} = (X_c' W X_c)^{-1} X_c' W Y_c\)

where \(W = \text{diag}(n_g)\) are the group frequencies. This procedure follows Wong et al. (2021).
“moments”: computes sufficient statistics (\(X'X, X'y\)) directly via SQL aggregation, returning a single-row result. This solves the standard OLS normal equations \(\hat{\beta} = (X'X)^{-1}X'y\). Limited to cases without FE.
“demean” (alias “within”): subtracts group-level means from both Y and X before computing sufficient statistics (per the “moments” strategy). For example, given unit \(i\) and time \(t\) FE, we apply double demeaning:

\(\ddot{Y}_{it} = \beta \ddot{X}_{it} + \varepsilon_{it}\)

where \(\ddot{X} = X - \bar{X}_i - \bar{X}_t + \bar{X}\). This (single-pass) within transformation is algebraically equivalent to the fixed effects projection—i.e., Frisch-Waugh-Lovell partialling out—in the presence of a single FE. It is also identical for the two-way FE (TWFE) case if your panel is balanced. For unbalanced two-way panels, however, the double demeaning strategy is not algebraically equivalent to the fixed effects projection and therefore does not recover the exact TWFE coefficients. Moreover, note that this “demean” strategy permits at most two FE.
“mundlak”: a generalized Mundlak (1978), or correlated random effects (CRE) estimator that regresses Y on X plus group means of X:

\(Y_{it} = \alpha + \beta X_{it} + \gamma \bar{X}_i + \varepsilon_{it} \quad \text{(one-way)}\)

\(Y_{it} = \alpha + \beta X_{it} + \gamma \bar{X}_{i} + \delta \bar{X}_{t} + \varepsilon_{it} \quad \text{(two-way, etc.)}\)

Unlike “demean”, Y is not transformed, so predictions are on the original scale. Supports any number of FE and works correctly for any panel structure (balanced or unbalanced). However, note that CRE is a different model from FE: while coefficients are asymptotically equivalent under certain assumptions, they will generally differ in finite samples.

The relative efficiency of each of these strategies depends on the size and structure of the data, as well as the number of unique regressors and FE. For (quote unquote) "standard" cases, the “compress” strategy can yield remarkable performance gains and should justifiably be viewed as a good default. However, the compression approach tends to be less efficient for true panels (repeated cross-sections over time), where N >> T. In such cases, it can be more efficient to use a demeaning strategy that first controls for (e.g. subtracts) group means, before computing sufficient statistics on the aggregated data. The reason for this is that time and unit FE are typically high dimensional, but covariate averages are not; see Arkhangelsky & Imbens (2024).

However, the demeaning approaches invite tradeoffs of their own. For example, the double demeaning transformation of the “demean” strategy does not obtain exact TWFE results in unbalanced panels, and it is also limited to at most two FE. Conversely, the “mundlak” (CRE) strategy obtains consistent coefficients regardless of panel structure and FE count, but at the "cost" of recovering a different estimand. (It is a different model to TWFE, after all.) See Wooldridge (2025) for an extended discussion of these issues.

Users should weigh these tradeoffs when choosing their acceleration strategy. Summarising, we can provide a few guiding principles. “compress” is a good default that guarantees the "exact" FE estimates and is usually very efficient (barring data I/O costs and high FE dimensionality). “mundlak” is another efficient alternative provided that the CRE estimand is acceptable (don’t be alarmed if your coefficients are not identical). Finally, the “demean” and “moments” strategies are great for particular use cases (i.e., balanced panels and cases without FE, respectively).

If this all sounds like too much to think about, don’t fret. The good news is that dbreg can do a lot (all?) of the deciding for you. Specifically, it will invoke an “auto” heuristic behind the scenes if a user does not provide an explicit acceleration strategy. Working through the heuristic logic does impose some additional overhead, but this should be negligible in most cases (certainly compared to the overall time savings). The “auto” heuristic is as follows:

IF no FE AND (any continuous regressor OR poor compression ratio OR too big compressed data) THEN “moments”.
ELSE IF 1 FE AND (poor compression ratio OR too big compressed data) THEN “demean”.
ELSE IF 2 FE AND (poor compression ratio OR too big compressed data):
- IF balanced panel THEN “demean”.
- ELSE error (exact TWFE infeasible; user must explicitly choose “compress” or “mundlak”).
ELSE THEN “compress”.

Tip: set dbreg(…, verbose = TRUE) to print information about the auto strategy decision criteria.

References

Arkhangelsky, D. & Imbens, G. (2024) Fixed Effects and the Generalized Mundlak Estimator. The Review of Economic Studies, 91(5), pp. 2545–2571. Available: https://doi.org/10.1093/restud/rdad089

Mundlak, Y. (1978) On the Pooling of Time Series and Cross Section Data. Econometrica, 46(1), pp. 69–85. Available: https://doi.org/10.2307/1913646

Wong, J., Forsell, E., Lewis, R., Mao, T., & Wardrop, M. (2021). You Only Compress Once: Optimal Data Compression for Estimating Linear Models. arXiv preprint arXiv:2102.11297. Available: https://doi.org/10.48550/arXiv.2102.11297

Wooldridge, J.M. (2025) Two-way fixed effects, the two-way mundlak regression, and difference-in-differences estimators. Empirical Economics, 69, pp. 2545–2587. Available: https://doi.org/10.1007/s00181-025-02807-z

Examples

library("dbreg")

## In-memory data ----

# We can pass in-memory R data frames to an ephemeral DuckDB connection via
# the `data` argument. This is convenient for small(er) datasets and demos.

# Default "compress" strategy reduces the data to 4 rows before running OLS
dbreg(weight ~ Diet, data = ChickWeight)

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 4 (compressed) 
Standard Errors: IID 
            Estimate Std. Error  t value   Pr(>|t|)    
(Intercept) 102.6455    4.67395 21.96116  < 2.2e-16 ***
Diet2        19.9712    7.86744  2.53846 1.1397e-02 *  
Diet3        40.3045    7.86744  5.12296 4.1139e-07 ***
Diet4        32.6173    7.91046  4.12331 4.2864e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 69.1                Adj. R2: 0.048530

# Compare with lm
summary(lm(weight ~ Diet, data = ChickWeight))$coefficients

             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 102.64545   4.673954 21.961161 4.712762e-78
Diet2        19.97121   7.867437  2.538465 1.139708e-02
Diet3        40.30455   7.867437  5.122958 4.113938e-07
Diet4        32.61726   7.910461  4.123307 4.286352e-05

# Add "fixed effects" after a `|` 
dbreg(weight ~ Time | Diet, data = ChickWeight)

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 48 (compressed) 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.221805 39.4512 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.743522

# "robust" SEs can also be computed using a sufficient statistics approach
dbreg(weight ~ Time | Diet, data = ChickWeight, vcov = "hc1")

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 48 (compressed) 
Standard Errors: Heteroskedasticity-robust 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.261676 33.4402 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.743522

dbreg(weight ~ Time | Diet, data = ChickWeight, vcov = ~Chick)

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 48 (compressed) 
Standard Errors: Clustered (50 clusters) 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.527463 16.5898 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.743522

# Different acceleration strategies + specifications
dbreg(weight ~ Time | Diet, data = ChickWeight, strategy = "demean")

Demeaned OLS estimation, Dep. Var.: weight 
Observations.: 578 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.221805 39.4512 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.729032

dbreg(weight ~ Time | Diet, data = ChickWeight, strategy = "mundlak")

One-way Mundlak OLS estimation, Dep. Var.: weight 
Observations.: 578 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049    0.22765 38.4383 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 36.8                Adj. R2: 0.729827

dbreg(weight ~ Time | Diet + Chick, data = ChickWeight, strategy = "mundlak") # two-way Mundlak

Two-way Mundlak OLS estimation, Dep. Var.: weight 
Observations.: 578 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.71519   0.229744 37.9344 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 36.8                Adj. R2: 0.729953

dbreg(weight ~ Time, data = ChickWeight, strategy = "moments") # no FEs

Moments-based OLS estimation, Dep. Var.: weight 
Observations.: 578 
Standard Errors: IID 
            Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 27.46743   3.036464  9.04586 < 2.2e-16 ***
Time         8.80304   0.239700 36.72524 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 38.8                Adj. R2: 0.700220

# etc.

# Interactions: does the effect of Time vary by Diet?
# (Diet main effects are collinear with Chick FE, so these drop out)
dbreg(weight ~ Time * Diet | Chick, data = ChickWeight)

Demeaned OLS estimation, Dep. Var.: weight 
Observations.: 578 
Standard Errors: IID 
           Estimate Std. Error  t value   Pr(>|t|)    
Time        6.69062   0.259809 25.75212  < 2.2e-16 ***
Time:Diet2  1.91851   0.429357  4.46834 9.6656e-06 ***
Time:Diet3  4.73225   0.429357 11.02170  < 2.2e-16 ***
Time:Diet4  2.96531   0.435005  6.81674 2.5702e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 24.2                Adj. R2: 0.844227
3 variables were removed because of collinearity (Diet2, Diet3, Diet4)

#
## DBI connection ----

# For persistent databases or more control, use the `conn` + `table` args.
# Again, we use DuckDB below but any other DBI-supported backend should work
# too (e.g., odbc, bigrquery, noctua (AWS Athena),  etc.) See:
# https://r-dbi.org/backends/

library(DBI)
con = dbConnect(duckdb::duckdb())
dbWriteTable(con, "cw", as.data.frame(ChickWeight))

dbreg(weight ~ Time | Diet, conn = con, table = "cw")

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 48 (compressed) 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.221805 39.4512 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.743522

# Tip: Rather than creating or writing (temp) tables, use CREATE VIEW to
# define subsets or computed columns without materializing data. This is more
# efficient and especially useful for filtering or adding variables.
dbExecute(
  con,
  "
  CREATE VIEW cw1 AS
  SELECT *
  FROM cw
  WHERE Diet = 1
  "
)

[1] 0

dbreg(weight ~ Time | Chick, conn = con, table = "cw1")

Demeaned OLS estimation, Dep. Var.: weight 
Observations.: 220 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  6.69062   0.246299 27.1646 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 22.9                Adj. R2: 0.766255

#
## Path to file ----
#
# For file-based data (e.g., parquet), use the path argument.

tmp = tempfile(fileext = ".parquet")
dbExecute(con, sprintf("COPY cw TO '%s' (FORMAT PARQUET)", tmp))

[1] 578

dbreg(weight ~ Time | Diet, path = tmp)

Compressed OLS estimation, Dep. Var.: weight 
Observations.: 578 (original) | 48 (compressed) 
Standard Errors: IID 
     Estimate Std. Error t value  Pr(>|t|)    
Time  8.75049   0.221805 39.4512 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 35.8                Adj. R2: 0.743522

# Cleanup
dbDisconnect(con)
unlink(tmp)

#
## Big dataset ----

# For a more compelling and appropriate dbreg use-case, i.e. regression on a
# big (~180 million row) dataset of Hive-partioned parquet files, see the
# package website:
# https://grantmcdermott.com/dbreg/