diff-diff/diff_diff/spillover.py at main · igerber/diff-diff

History

3000 lines (2750 loc) · 136 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

"""

SpilloverDiD — Butts (2021) ring-indicator spillover-aware DiD.

Augments a two-stage Gardner (2022) DiD with ring-indicator covariates that

identify the spillover effect on near-control units alongside the direct

effect on treated units. Handles both panel non-staggered and Section 5

staggered timing in a single estimator.

References

----------

Butts, K. (2023). Difference-in-Differences with Spatial Spillovers.

arXiv:2105.03737v3 (originally posted 2021).

Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.

Notes

-----

The paper's notation in Equation 5/6 is ``(1 - D_it) * Ring_{ij}`` with

``S_i`` unit-static. Reading that literally under a two-way fixed effects

specification yields a rank-deficient design (``(1 - D_it) * S_i = S_i -

D_it``; ``S_i`` is absorbed by ``mu_i``, leaving ``-D_it``). The paper

defines ``S_it = S_i * 1{t >= t_treat}`` (page 12, just above Equation 5)

and Section 5's Table 2 makes the time-varying form explicit

(``S^k_{it}``, ``Ring^k_{it,j}``). This implementation uses the

time-varying form, which is the spec that supports the paper's

identification argument (Proposition 2.3 + Section 3.1 subsample logic).

"""

import warnings

from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union

import numpy as np

import pandas as pd

from scipy import sparse

from diff_diff.conley import (

_CONLEY_EARTH_RADIUS_KM,

_CONLEY_SPARSE_N_THRESHOLD,

_haversine_km,

_validate_callable_metric_result,

)

from diff_diff.linalg import solve_ols

from diff_diff.results import SpilloverDiDResults

from diff_diff.two_stage import _compute_gmm_corrected_meat

from diff_diff.utils import safe_inference

# Type alias mirroring diff_diff.conley.ConleyMetric so callers can supply

# any of the built-in identifiers or a user callable returning a pairwise

# distance matrix.

SpilloverMetric = Union[

Literal["haversine", "euclidean"],

Callable[[np.ndarray, np.ndarray], np.ndarray],

]

# =============================================================================

# Ring construction helpers (Step 1)

# =============================================================================

def _haversine_km_pairwise(

coords_a: np.ndarray,

coords_b: np.ndarray,

) -> np.ndarray:

"""Vectorized pairwise great-circle distance (km) between two coord sets.

Parameters

----------

coords_a : ndarray of shape (n_a, 2)

``(lat, lon)`` in DEGREES for the first set of points.

coords_b : ndarray of shape (n_b, 2)

``(lat, lon)`` in DEGREES for the second set of points.

Returns

-------

ndarray of shape (n_a, n_b)

Great-circle distances in km. Matches the ``_haversine_km`` Earth

radius convention (6371.01 km, mirroring R ``conleyreg``).

"""

lat_a = coords_a[:, 0][:, None]

lon_a = coords_a[:, 1][:, None]

lat_b = coords_b[:, 0][None, :]

lon_b = coords_b[:, 1][None, :]

return _haversine_km(lat_a, lon_a, lat_b, lon_b)

def _euclidean_pairwise(

coords_a: np.ndarray,

coords_b: np.ndarray,

) -> np.ndarray:

"""Vectorized pairwise Euclidean distance between two coord sets.

Coordinates are treated as planar; no unit conversion. Matches the

``_pairwise_distance_matrix`` Euclidean branch of ``conley.py``.

"""

diffs = coords_a[:, None, :] - coords_b[None, :, :]

return np.sqrt(np.einsum("ijk,ijk->ij", diffs, diffs))

def _apply_callable_metric_pairwise(

metric: Callable[[np.ndarray, np.ndarray], np.ndarray],

coords_a: np.ndarray,

coords_b: np.ndarray,

) -> np.ndarray:

"""Apply a user-supplied callable metric to two coord sets.

Unlike :func:`_validate_callable_metric_result` which checks square

``(n, n)`` symmetry on a single coord set, this helper accepts a

rectangular ``(n_a, n_b)`` result. The validator is therefore relaxed:

we only require finiteness, non-negativity, and correct shape. The

zero-diagonal / symmetry checks apply only when the same coord set is

passed on both sides; ring-construction usage passes a treated-only

subset on side B, so the diagonal of the rectangular result is not

meaningful.

"""

result = metric(coords_a, coords_b)

arr = np.asarray(result, dtype=np.float64)

expected_shape = (coords_a.shape[0], coords_b.shape[0])

if arr.shape != expected_shape:

raise ValueError(

"conley_metric callable returned shape "

f"{arr.shape} for pairwise ring distance, expected {expected_shape}."

)

if not np.isfinite(arr).all():

raise ValueError(

"conley_metric callable returned non-finite entries for pairwise "

"ring distance; all distances must be finite."

)

if (arr < 0.0).any():

raise ValueError(

"conley_metric callable returned negative entries for pairwise "

"ring distance; all distances must be non-negative."

)

return arr

def _pairwise_ring_distances(

coords_units: np.ndarray,

coords_treated: np.ndarray,

metric: SpilloverMetric,

) -> np.ndarray:

"""Compute (n_units, n_treated) pairwise distances under the chosen metric."""

if callable(metric):

return _apply_callable_metric_pairwise(metric, coords_units, coords_treated)

if metric == "haversine":

return _haversine_km_pairwise(coords_units, coords_treated)

if metric == "euclidean":

return _euclidean_pairwise(coords_units, coords_treated)

raise ValueError(

f"Unknown conley_metric: {metric!r}. Expected 'haversine', 'euclidean', "

"or a callable returning a pairwise distance matrix."

)

def _compute_nearest_treated_distance_static(

data: pd.DataFrame,

unit: str,

coords: Tuple[str, str],

metric: SpilloverMetric,

treated_unit_ids: np.ndarray,

cutoff_km: Optional[float] = None,

) -> Tuple[np.ndarray, np.ndarray]:

"""Return per-unit nearest-treated distance for the non-staggered case.

The set of treated units is fixed (ever-treated), so distances are

unit-level constants and don't vary across periods. Caller broadcasts

to per-row when assembling ring covariates.

Parameters

----------

data : pd.DataFrame

Panel data with one row per (unit, period). Used to extract

per-unit coords via :meth:`DataFrame.drop_duplicates` on ``unit``.

unit : str

Column name of the unit identifier.

coords : tuple of (str, str)

``(lat_col, lon_col)``.

metric : "haversine" | "euclidean" | callable

Distance metric. For ``"haversine"``, ``coords`` is interpreted as

``(lat, lon)`` in degrees. For ``"euclidean"``, ``coords`` is

planar. Callable receives two ``(n, 2)`` arrays and must return an

``(n_a, n_b)`` finite non-negative distance matrix.

treated_unit_ids : ndarray

IDs of ever-treated units (used as side B of pairwise distance).

cutoff_km : float, optional

If set and ``len(unit_index) > _CONLEY_SPARSE_N_THRESHOLD``, the

sparse cKDTree path is used to find treated neighbors within

``cutoff_km`` per unit; otherwise the dense (n_units × n_treated)

matrix is built. Units with no treated neighbor within ``cutoff_km``

receive ``d_i = inf`` (they fall outside any ring and into the

far-away control group, identical to dense-path behavior with

infinite distance to the nearest reached treated unit).

Returns

-------

d_i : ndarray of shape (n_unique_units,)

``d_i = min_{k in treated_unit_ids} d(i, k)`` per unique unit.

unit_index : ndarray of shape (n_unique_units,)

Unit identifiers in the same order as ``d_i``.

"""

unit_coords_df = (

data[[unit, coords[0], coords[1]]]

.drop_duplicates(subset=[unit])

.set_index(unit)

.sort_index()

)

unit_index = np.asarray(unit_coords_df.index.values)

all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)

treated_set = set(treated_unit_ids.tolist())

treated_mask = np.array([uid in treated_set for uid in unit_index], dtype=bool)

treated_coords = all_coords[treated_mask]

if treated_coords.shape[0] == 0:

raise ValueError(

"_compute_nearest_treated_distance_static: no treated units present "

"in `data` matching `treated_unit_ids`."

)

n_units = all_coords.shape[0]

is_builtin_metric = metric in ("haversine", "euclidean")

if cutoff_km is not None and n_units > _CONLEY_SPARSE_N_THRESHOLD and is_builtin_metric:

d_i = _compute_nearest_treated_distance_sparse(

all_coords=all_coords,

treated_coords=treated_coords,

metric=metric, # type: ignore[arg-type]

cutoff_km=float(cutoff_km),

)

else:

# Dense path: full pairwise matrix, then row-min.

dists = _pairwise_ring_distances(all_coords, treated_coords, metric)

d_i = dists.min(axis=1)

return d_i.astype(np.float64), unit_index

def _compute_nearest_treated_distance_sparse(

all_coords: np.ndarray,

treated_coords: np.ndarray,

metric: Literal["haversine", "euclidean"],

cutoff_km: float,

) -> np.ndarray:

"""Sparse cKDTree path for nearest-treated-distance computation.

Used when ``n_units > _CONLEY_SPARSE_N_THRESHOLD`` AND the metric is a

built-in string. The tree is built on the treated subset (small) and

queried with each unit row. Units with no treated neighbor inside

``cutoff_km`` get ``d_i = inf``, which places them in the far-away

control group on the downstream ring-membership step.

For haversine: lat/lon are projected to 3-D unit-sphere Cartesian

coordinates; the chord-distance query radius is

``2 * sin(arc / (2 * R_earth))`` with arc clamped at ``pi * R_earth``

so cutoffs beyond a hemisphere don't shrink. Exact great-circle

distances are then recomputed via :func:`_haversine_km` for the in-

range matches and the per-row minimum is taken.

For euclidean: planar L2 directly in cKDTree.

Parameters

----------

all_coords : ndarray of shape (n_units, 2)

Coordinates for all units.

treated_coords : ndarray of shape (n_treated, 2)

Coordinates for ever-treated units.

metric : 'haversine' or 'euclidean'

Built-in metric only; callables fall back to the dense path.

cutoff_km : float

Maximum considered distance. Units beyond this get ``d_i = inf``.

Returns

-------

ndarray of shape (n_units,)

Nearest-treated distance per unit (inf when no neighbor in range).

"""

# Imported lazily to mirror conley.py's lazy-scipy pattern and keep

# module import cheap when the sparse path isn't exercised.

from scipy.spatial import cKDTree # noqa: WPS433 (deferred import)

n_units = all_coords.shape[0]

if metric == "haversine":

# Project lat/lon (degrees) to 3-D unit-sphere Cartesian.

lat_rad_all = np.radians(all_coords[:, 0])

lon_rad_all = np.radians(all_coords[:, 1])

unit_xyz = np.column_stack(

[

np.cos(lat_rad_all) * np.cos(lon_rad_all),

np.cos(lat_rad_all) * np.sin(lon_rad_all),

np.sin(lat_rad_all),

]

)

lat_rad_tr = np.radians(treated_coords[:, 0])

lon_rad_tr = np.radians(treated_coords[:, 1])

tree_xyz = np.column_stack(

[

np.cos(lat_rad_tr) * np.cos(lon_rad_tr),

np.cos(lat_rad_tr) * np.sin(lon_rad_tr),

np.sin(lat_rad_tr),

]

)

# Chord-distance radius for the query; clamp arc at pi (a half-revolution)

# so cutoffs > pi * R_earth do not shrink chord radius below the true reach.

arc_radians = min(cutoff_km / _CONLEY_EARTH_RADIUS_KM, np.pi)

query_r = 2.0 * np.sin(arc_radians / 2.0)

query_r *= 1.0 + 1e-12 # numerical safety margin

tree = cKDTree(tree_xyz)

# Query in chord space, recompute exact great-circle distance for matches.

neighbors = tree.query_ball_point(unit_xyz, r=query_r, p=2.0)

d_i = np.full(n_units, np.inf, dtype=np.float64)

for i, idxs in enumerate(neighbors):

if not idxs:

continue

# Exact great-circle distance for the in-range treated neighbors.

arr_idxs = np.asarray(idxs, dtype=np.intp)

d_subset = _haversine_km(

all_coords[i, 0],

all_coords[i, 1],

treated_coords[arr_idxs, 0],

treated_coords[arr_idxs, 1],

)

d_i[i] = float(d_subset.min())

return d_i

# Euclidean: cKDTree handles directly.

tree = cKDTree(treated_coords)

d_i = np.full(n_units, np.inf, dtype=np.float64)

neighbors = tree.query_ball_point(all_coords, r=cutoff_km, p=2.0)

for i, idxs in enumerate(neighbors):

if not idxs:

continue

arr_idxs = np.asarray(idxs, dtype=np.intp)

d_subset = _euclidean_pairwise(all_coords[i : i + 1], treated_coords[arr_idxs])

d_i[i] = float(d_subset.min())

return d_i

def _compute_nearest_treated_distance_staggered(

data: pd.DataFrame,

unit: str,

time: str,

coords: Tuple[str, str],

metric: SpilloverMetric,

first_treat_by_unit: Dict[Any, Any],

d_bar: Optional[float] = None,

) -> Tuple[np.ndarray, np.ndarray, np.ndarray, Optional[np.ndarray]]:

"""Return per-row nearest-treated distance for the staggered case.

For each (unit, period) observation, find the minimum distance to any

unit that is treated BY THE END of that period (``first_treat_k <=

t``). Ring membership in the staggered case is therefore unit-time

varying.

Parameters

----------

data : pd.DataFrame

Panel data (one row per unit-period).

unit : str

Unit identifier column name.

time : str

Time period column name.

coords : tuple of (str, str)

``(lat_col, lon_col)``.

metric : "haversine" | "euclidean" | callable

Distance metric.

first_treat_by_unit : dict

Mapping from unit identifier to onset time (or ``np.inf`` for

never-treated). Generated by :func:`_extract_treatment_onsets`.

d_bar : float, optional

When supplied, the function additionally computes the per-row

**spillover-trigger onset** (earliest cohort onset whose treated

units fall within ``d_bar`` of unit ``i``) reusing the cohort

loop. Used by :func:`_compute_event_time_per_row` to avoid a

duplicate cohort pass on the event-study path

(PR #456 R6 performance fix).

Notes

-----

The staggered helper currently always uses dense pairwise distance per

cohort. A sparse cKDTree branch (mirroring the static helper) is queued

as a follow-up — see TODO.md.

Returns

-------

d_it : ndarray of shape (n_rows,)

Per-row nearest-treated distance, with ``inf`` for rows where no

unit has been treated yet by time t (early periods).

row_unit : ndarray of shape (n_rows,)

Aligned unit identifier per row (for downstream broadcasting).

row_time : ndarray of shape (n_rows,)

Aligned time identifier per row.

trigger_onset_per_row : ndarray of shape (n_rows,) or None

``None`` when ``d_bar`` is None. Otherwise: per-row earliest

cohort onset whose treated units fall within ``d_bar`` of the

row's unit, broadcast from per-unit. NaN for rows whose unit is

never within ``d_bar`` of any cohort.

"""

unit_coords_df = (

data[[unit, coords[0], coords[1]]].drop_duplicates(subset=[unit]).set_index(unit)

)

unit_index = np.asarray(unit_coords_df.index.values)

all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)

unit_to_pos = {uid: pos for pos, uid in enumerate(unit_index)}

row_unit = np.asarray(data[unit].values)

row_time = np.asarray(data[time].values)

n_rows = len(row_unit)

d_it = np.full(n_rows, np.inf, dtype=np.float64)

trigger_onset_per_unit_pos: Optional[np.ndarray] = (

np.full(len(unit_index), np.nan, dtype=np.float64) if d_bar is not None else None

)

# Determine the cohort onset times that exist in the data (excluding never-treated).

unique_onsets = sorted({ft for ft in first_treat_by_unit.values() if np.isfinite(ft)})

if not unique_onsets:

# Degenerate: no treated units. Caller should have rejected this

# in `_validate_spillover_inputs`, but defensively return inf.

return d_it, row_unit, row_time, None

# Row's unit position. Invariant across cohort iterations — compute

# once outside the loop.

row_pos = np.array([unit_to_pos[uid] for uid in row_unit], dtype=np.intp)

# For each unique onset time, compute (n_units, n_treated_by_then) pairwise

# distances ONCE, then assign to rows whose t >= that onset (carrying forward

# the minimum across cohorts).

for onset in unique_onsets:

treated_by_onset_ids = [uid for uid, ft in first_treat_by_unit.items() if ft <= onset]

treated_positions = np.array(

[unit_to_pos[uid] for uid in treated_by_onset_ids if uid in unit_to_pos],

dtype=np.intp,

)

if treated_positions.size == 0:

continue

treated_coords = all_coords[treated_positions]

# Compute per-unit nearest distance to this cohort's treated set.

dists_to_cohort = _pairwise_ring_distances(all_coords, treated_coords, metric).min(axis=1)

# Update rows whose period t >= onset: take min of current d_it and the

# newly-available cohort distance.

affected_rows = row_time >= onset

if not affected_rows.any():

continue

row_cohort_dist = dists_to_cohort[row_pos]

# Only update rows where this cohort's distance is smaller than the

# current d_it (carries the running minimum across cohorts).

update_mask = affected_rows & (row_cohort_dist < d_it)

d_it[update_mask] = row_cohort_dist[update_mask]

# Reuse this same cohort distance computation for the per-unit

# spillover-trigger onset when d_bar is supplied. The trigger is

# the FIRST cohort whose treated units fall within d_bar of unit

# i — once locked it persists for later cohort iterations. Using

# cumulative-treated distances here is fine: if a unit is in

# range of cohort c1, dists_to_cohort at onset=c1 already detects

# it; later iterations with extra treated units only shrink the

# distance, never grow it back above d_bar.

if trigger_onset_per_unit_pos is not None:

in_range_for_cohort = dists_to_cohort <= d_bar

not_yet_triggered = np.isnan(trigger_onset_per_unit_pos)

trigger_onset_per_unit_pos[in_range_for_cohort & not_yet_triggered] = onset

# Broadcast per-unit trigger to rows when computed.

if trigger_onset_per_unit_pos is not None:

trigger_onset_per_row = trigger_onset_per_unit_pos[row_pos]

else:

trigger_onset_per_row = None

return d_it, row_unit, row_time, trigger_onset_per_row

def _compute_event_time_per_row(

data: pd.DataFrame,

unit: str,

row_unit: np.ndarray,

row_time: np.ndarray,

effective_onsets: Dict[Any, float],

coords: Tuple[str, str],

metric: SpilloverMetric,

d_bar: float,

precomputed_trigger_onset_per_row: Optional[np.ndarray] = None,

) -> Tuple[np.ndarray, np.ndarray]:

"""Compute two event-time clocks per row for Wave C event-study mode.

Butts (2021) Section 5 / Table 2 uses one symbol ``K_it`` but operationally

there are TWO event-time clocks — one for the direct-effect series and one

for the spillover-exposure series. This helper returns both.

- ``K_direct[r] = row_time[r] - effective_onsets[row_unit[r]]`` for rows of

ever-treated units (any t, including pre-treatment k < 0 for placebo

coefficients). NaN for never-treated units.

- ``K_spill[r] = row_time[r] - trigger_onset[row_unit[r]]`` for rows where

the spillover-trigger cohort has activated by ``row_time[r]``. NaN

otherwise. ``trigger_onset[i]`` is the EARLIEST effective onset among

cohorts whose treated units fall within ``d_bar`` of unit ``i``.

Cohort onsets are iterated in ascending order so the trigger is the first

cohort that puts unit ``i`` in any ring — matching the running-min logic

used by :func:`_compute_nearest_treated_distance_staggered` for ``d_it``.

Parameters

----------

data : pd.DataFrame

Panel data; used to extract one (lat, lon) coordinate per unit.

unit, coords, metric, d_bar

Mirror :func:`_compute_nearest_treated_distance_staggered`.

row_unit, row_time : ndarray of shape (n_rows,)

Per-row identifiers (anticipation-adjusted onsets are baked into

``effective_onsets``; row_time is the raw period).

effective_onsets : dict

Mapping from unit identifier to anticipation-shifted first_treat

(``first_treat - anticipation``). ``np.inf`` for never-treated units.

Returns

-------

K_direct : ndarray of shape (n_rows,), float64 with NaN where undefined.

K_spill : ndarray of shape (n_rows,), float64 with NaN where undefined.

Notes

-----

PR #456 R6 performance fix: when ``precomputed_trigger_onset_per_row``

is supplied (as :func:`_compute_nearest_treated_distance_staggered`

now optionally returns when called with ``d_bar=...``), the cohort

loop is skipped — K_spill is derived directly from the precomputed

trigger. The fallback (compute trigger inline) is kept for unit-test

callers and other code paths that don't have access to the staggered

distance helper's output.

"""

n_rows = len(row_unit)

row_time_arr = np.asarray(row_time, dtype=np.float64)

# K_direct: per-row, derived from row_unit -> own effective_onset.

K_direct = np.full(n_rows, np.nan, dtype=np.float64)

own_onsets = np.array([effective_onsets.get(uid, np.inf) for uid in row_unit], dtype=np.float64)

direct_defined = np.isfinite(own_onsets)

K_direct[direct_defined] = row_time_arr[direct_defined] - own_onsets[direct_defined]

if precomputed_trigger_onset_per_row is not None:

# Fast path: reuse trigger onsets already computed by the staggered

# distance helper. Avoids a duplicate cohort loop.

row_trigger = np.asarray(precomputed_trigger_onset_per_row, dtype=np.float64)

K_spill = np.full(n_rows, np.nan, dtype=np.float64)

triggered = np.isfinite(row_trigger)

post_trigger = triggered & (row_time_arr >= row_trigger)

K_spill[post_trigger] = row_time_arr[post_trigger] - row_trigger[post_trigger]

return K_direct, K_spill

# Fallback path (test callers, etc.): compute trigger inline via own

# cohort loop. trigger_onset[i] = first effective_onset among cohorts

# whose treated units have d(i, treated_in_cohort) <= d_bar.

unit_coords_df = (

data[[unit, coords[0], coords[1]]].drop_duplicates(subset=[unit]).set_index(unit)

)

unit_index = np.asarray(unit_coords_df.index.values)

all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)

unit_to_pos = {uid: pos for pos, uid in enumerate(unit_index)}

unique_onsets = sorted({eff_ft for eff_ft in effective_onsets.values() if np.isfinite(eff_ft)})

trigger_onset_per_unit_pos = np.full(len(unit_index), np.nan, dtype=np.float64)

for onset in unique_onsets:

# Units treated AT THIS ONSET (not by/before; we want the cohort's

# own treated set so we can compute the per-onset distance front).

treated_at_onset_ids = [uid for uid, ft in effective_onsets.items() if ft == onset]

treated_positions = np.array(

[unit_to_pos[uid] for uid in treated_at_onset_ids if uid in unit_to_pos],

dtype=np.intp,

)

if treated_positions.size == 0:

continue

treated_coords = all_coords[treated_positions]

dists_to_cohort = _pairwise_ring_distances(all_coords, treated_coords, metric).min(axis=1)

in_range_for_cohort = dists_to_cohort <= d_bar

not_yet_triggered = np.isnan(trigger_onset_per_unit_pos)

trigger_onset_per_unit_pos[in_range_for_cohort & not_yet_triggered] = onset

# Broadcast trigger onset to rows; K_spill = t - trigger when t >= trigger.

row_pos = np.array([unit_to_pos.get(uid, -1) for uid in row_unit], dtype=np.intp)

K_spill = np.full(n_rows, np.nan, dtype=np.float64)

valid_pos = row_pos >= 0

row_trigger = np.where(

valid_pos, trigger_onset_per_unit_pos[np.where(valid_pos, row_pos, 0)], np.nan

)

triggered = np.isfinite(row_trigger)

post_trigger = triggered & (row_time_arr >= row_trigger)

K_spill[post_trigger] = row_time_arr[post_trigger] - row_trigger[post_trigger]

return K_direct, K_spill

def _apply_horizon_binning(

K_arr: np.ndarray,

horizon_max: Optional[int],

) -> np.ndarray:

"""Clip per-row event-time values into ``[-horizon_max, +horizon_max]`` bins.

Wave C event-study path uses bin-into-endpoint-pools semantics: rows with

event-time ``k < -H`` aggregate into a single ``k = -H`` dummy; rows with

``k > +H`` aggregate into a single ``k = +H`` dummy. No observations are

dropped (cf. TwoStageDiD's ``horizon_max`` which filters rows).

NaN values in ``K_arr`` propagate through (``np.clip`` preserves NaN by

default). Omega_0 / never-treated rows carry NaN K values, which cause

``1{K_binned = k}`` to evaluate False at every k — so they contribute 0

to all event-time dummies (correct identification: those rows enter

stage 1 only, not the event-time decomposition).

Parameters

----------

K_arr : ndarray

Per-row event-time values. NaN entries are passed through unchanged.

horizon_max : int or None

Bin width; if ``None``, no clipping (used for auto-detect path where

``H = max(|K_it|)`` provides the natural bound).

Returns

-------

ndarray of same shape and dtype as input, with NaN-preserving clamp applied.

"""

if horizon_max is None:

return K_arr.astype(np.float64, copy=False)

if not isinstance(horizon_max, (int, np.integer)) or horizon_max < 0:

raise ValueError(

f"horizon_max must be a non-negative integer or None; "

f"got {horizon_max!r} (type {type(horizon_max).__name__})."

)

return np.clip(K_arr.astype(np.float64, copy=False), -float(horizon_max), float(horizon_max))

def _build_event_study_design(

D_it: np.ndarray,

ring_masks: np.ndarray,

ring_labels: List[str],

K_direct_binned: np.ndarray,

K_spill_binned: np.ndarray,

event_time_grid: List[int],

ref_period: int,

) -> Tuple[

np.ndarray,

List[str],

List[Tuple[str, Optional[str], int]],

np.ndarray,

"""Build per-event-time × ring stage-2 design matrix for Wave C event-study.

The design has two series of dummies:

- **Direct effect**: ``D^k_{it} := 1{K_direct_{it} = k AND row is ever-treated}``

for each ``k ∈ event_time_grid \\ {ref_period}``. NaN entries in

``K_direct_binned`` (never-treated rows) cause the indicator to evaluate

False, naturally yielding zero contribution.

- **Spillover**: ``Ring^k_{it,j} := (1 - D_it) * ring_masks[:, j] * 1{K_spill_{it} = k}``

for each ring ``j`` and each ``k ∈ event_time_grid \\ {ref_period}``.

All-zero columns are pre-filtered (one summary warning instead of many),

but the FULL rectangular grid of (series, ring, k) tuples is also returned

so downstream code can emit the MultiIndex ``spillover_effects`` schema

with NaN coefficients for empty cells (per Wave C plan: rectangular).

Parameters

----------

D_it : ndarray of shape (n_rows,), float

Per-row binary indicator (treated AND post-treatment).

ring_masks : ndarray of shape (n_rows, K), bool

Per-row ring-membership indicators (from :func:`_build_ring_indicators`).

ring_labels : list of K strings

Human-readable labels for each ring band.

K_direct_binned, K_spill_binned : ndarray of shape (n_rows,), float64

Per-row event-time clocks (NaN where undefined). Already passed through

:func:`_apply_horizon_binning` if applicable.

event_time_grid : list of int

The full event-time bin set (e.g. ``[-3, -2, -1, 0, 1, 2, 3]`` for

``horizon_max=3``). Reference period is dropped from this list inside

the helper.

ref_period : int

The event-time integer to drop from BOTH series.

Returns

-------

X_2 : ndarray of shape (n_rows, n_kept_cols)

Stage-2 design matrix (only non-empty columns kept).

kept_col_names : list of str

Column labels matching X_2 columns. Convention: ``"D^k=+0"``,

``"_spillover_[0, 50)^k=-2"``, with signed integer suffix.

kept_col_meta : list of (series, ring_label_or_None, k)

Tuple metadata per kept column (``series ∈ {"direct", "spillover"}``).

rectangular_grid : list of (series, ring_label_or_None, k)

FULL grid of (series, ring, k) entries including those dropped because

the column was all zeros. Used for rectangular MultiIndex emission.

Order matches the design layout (direct first, then per-ring spillover).

n_obs_per_col : ndarray of shape (n_kept_cols,), int64

Count of rows with a non-zero contribution to each kept column.

"""

if not isinstance(ref_period, (int, np.integer)):

raise TypeError(

f"ref_period must be an integer; got {ref_period!r} "

f"(type {type(ref_period).__name__})."

)

K = ring_masks.shape[1]

if len(ring_labels) != K:

raise ValueError(

f"ring_labels length ({len(ring_labels)}) must match number of " f"rings ({K})."

)

# The grid of event-times to emit dummies for, with the reference dropped.

k_grid = [int(k) for k in event_time_grid if int(k) != int(ref_period)]

one_minus_D = 1.0 - D_it.astype(np.float64)

ring_masks_f = ring_masks.astype(np.float64)

K_direct_f = np.asarray(K_direct_binned, dtype=np.float64)

K_spill_f = np.asarray(K_spill_binned, dtype=np.float64)

def _signed(k: int) -> str:

return f"{k:+d}"

# Build candidate columns in canonical order:

# 1) all direct-effect dummies, ascending k

# 2) per-ring spillover dummies (ascending ring, ascending k within)

candidate_cols: List[Tuple[str, Optional[str], int, np.ndarray]] = []

rectangular_grid: List[Tuple[str, Optional[str], int]] = []

for k in k_grid:

# Direct-effect dummy: D_i (implicit via NaN-on-never-treated) * 1{K_direct = k}.

col = (K_direct_f == float(k)).astype(np.float64)

candidate_cols.append(("direct", None, k, col))

rectangular_grid.append(("direct", None, k))

for j in range(K):

ring_lab = ring_labels[j]

for k in k_grid:

# Spillover dummy: (1 - D_it) * Ring_j * 1{K_spill = k}.

col = one_minus_D * ring_masks_f[:, j] * (K_spill_f == float(k)).astype(np.float64)

candidate_cols.append(("spillover", ring_lab, k, col))

rectangular_grid.append(("spillover", ring_lab, k))

# Pre-filter all-zero columns to keep solve_ols's rank-deficient warning

# noise low. Track the kept set.

kept_indices: List[int] = []

kept_cols_list: List[np.ndarray] = []

kept_col_names: List[str] = []

kept_col_meta: List[Tuple[str, Optional[str], int]] = []

n_obs_list: List[int] = []

n_dropped = 0

for idx, (series, ring_lab, k, col) in enumerate(candidate_cols):

n_nonzero = int(np.count_nonzero(col))

if n_nonzero == 0:

n_dropped += 1

continue

kept_indices.append(idx)

kept_cols_list.append(col)

if series == "direct":

kept_col_names.append(f"D^k={_signed(k)}")

else:

kept_col_names.append(f"_spillover_{ring_lab}^k={_signed(k)}")

kept_col_meta.append((series, ring_lab, k))

n_obs_list.append(n_nonzero)

if n_dropped > 0:

warnings.warn(

f"SpilloverDiD event-study: {n_dropped} of "

f"{len(candidate_cols)} stage-2 design column(s) were "

"all-zero (no rows contribute) and dropped before fitting. "

"Empty (series, ring, event_time) cells appear in the result "

"with coef=NaN and n_obs=0 (rectangular schema). To shrink the "

"emitted grid, reduce horizon_max or use horizon_max=None for "

"auto-detection.",

UserWarning,

stacklevel=2,

)

if not kept_cols_list:

# All columns dropped — degenerate. Return empty design; caller

# handles via downstream df_resid check + safe_inference NaN

# propagation.

X_2 = np.zeros((len(D_it), 0), dtype=np.float64)

else:

X_2 = np.column_stack(kept_cols_list)

n_obs_per_col = np.asarray(n_obs_list, dtype=np.int64)

return X_2, kept_col_names, kept_col_meta, rectangular_grid, n_obs_per_col

def _extract_event_study_results(

coef: np.ndarray,

vcov: Optional[np.ndarray],

col_names_all: List[str],

kept_col_meta: List[Tuple[str, Optional[str], int]],

rectangular_grid: List[Tuple[str, Optional[str], int]],

n_obs_per_col: np.ndarray,

ref_period: int,

df_resid: int,

alpha: float,

ring_labels: List[str],

) -> Tuple[

float,

Tuple[float, float],

Optional[pd.DataFrame],

Optional[Dict[int, Dict[str, Any]]],

Dict[str, float],

"""Extract per-event-time inference and the share-weighted scalar ``att``.

Builds three output surfaces from a single stage-2 fit:

- ``att_dynamic`` : per-event-time direct-effect DataFrame indexed by ``k``.

Includes the reference period row with ``coef=0.0, se=0.0, n_obs=0``.

Rectangular emission across the full direct-effect event-time grid.

- ``spillover_effects`` : MultiIndex ``(ring_label, event_time)`` DataFrame

with the same columns. Rectangular over the full spillover grid.

- ``event_study_effects`` : TwoStageDiD-compatible alias matching

``two_stage.py:1355-1389`` schema (``conf_int`` as ``(low, high)`` tuple,

reference period as ``(0.0, 0.0)``).

Scalar ``att`` uses sample-share weighting on post-treatment ``tau_k``

with SE from linear-combination inference on the corresponding vcov

submatrix.

"""

# Per-coefficient inference dict keyed by (series, ring_label, k).

per_coef: Dict[Tuple[str, Optional[str], int], Dict[str, Any]] = {}

for i, (series, ring_label, k) in enumerate(kept_col_meta):

coef_i = float(coef[i]) if np.isfinite(coef[i]) else float("nan")

if vcov is not None and np.isfinite(vcov[i, i]):

se_i = float(np.sqrt(max(vcov[i, i], 0.0)))

else:

se_i = float("nan")

t_i, p_i, ci_i = safe_inference(coef_i, se_i, alpha=alpha, df=df_resid)

per_coef[(series, ring_label, k)] = {

"coef": coef_i,

"se": se_i,

"t_stat": t_i,

"p_value": p_i,

"ci_low": ci_i[0],

"ci_high": ci_i[1],

"n_obs": int(n_obs_per_col[i]),

}

direct_k_set = sorted({k for (s, _, k) in rectangular_grid if s == "direct"})

spillover_k_set = sorted({k for (s, _, k) in rectangular_grid if s == "spillover"})

# Build att_dynamic: rectangular over direct event-time grid + reference row.

all_direct_ks = sorted(set(direct_k_set) | {ref_period})

direct_rows: List[Dict[str, Any]] = []

for k in all_direct_ks:

if k == ref_period:

direct_rows.append(

{

"k": k,

"coef": 0.0,

"se": 0.0,

"t_stat": float("nan"),

"p_value": float("nan"),

"ci_low": 0.0,

"ci_high": 0.0,

"n_obs": 0,

}

)

elif ("direct", None, k) in per_coef:

r = per_coef[("direct", None, k)]

direct_rows.append({"k": k, **r})

else:

direct_rows.append(

{

"k": k,

"coef": float("nan"),

"se": float("nan"),

"t_stat": float("nan"),

"p_value": float("nan"),

"ci_low": float("nan"),

"ci_high": float("nan"),

"n_obs": 0,

}

)

att_dynamic_df = pd.DataFrame(direct_rows).set_index("k").sort_index() if direct_rows else None

# Build spillover_effects: rectangular over (ring_label, k) grid.

# PR #456 R1 fix (P3): the spillover grid must INCLUDE the reference

# period row per ring. The pre-filter in _build_event_study_design drops

# `ref_period` from the fitted column set, but the rectangular schema

# for spillover must still emit (ring, ref_period) with `coef=0.0,

# se=0.0, n_obs=0` for symmetry with the direct-effect series (which

# emits its reference row at k=ref_period). Without this, consumers

# iterating `[-H, ..., +H]` would hit a missing (ring, ref_period)

# slice — the registry promises rectangular emission over the full

# event-time grid.

all_spillover_ks = sorted(set(spillover_k_set) | {ref_period})

spillover_rows: List[Dict[str, Any]] = []

for ring_lab in ring_labels:

for k in all_spillover_ks:

if k == ref_period:

# Reference-period spillover row: 0-anchored (mirrors direct).

spillover_rows.append(

{

"ring": ring_lab,

"k": k,

"coef": 0.0,

"se": 0.0,

"t_stat": float("nan"),

"p_value": float("nan"),

"ci_low": 0.0,

"ci_high": 0.0,

"n_obs": 0,

}

)

continue

key = ("spillover", ring_lab, k)

if key in per_coef:

r = per_coef[key]

spillover_rows.append({"ring": ring_lab, "k": k, **r})

else:

spillover_rows.append(

{

"ring": ring_lab,

"k": k,

"coef": float("nan"),

"se": float("nan"),

"t_stat": float("nan"),

"p_value": float("nan"),

"ci_low": float("nan"),

"ci_high": float("nan"),

"n_obs": 0,

}

)

spillover_df = (

pd.DataFrame(spillover_rows).set_index(["ring", "k"]).sort_index()

if spillover_rows

else None

)

# Build event_study_effects dict (TwoStageDiD-compatible).

event_study_effects: Dict[int, Dict[str, Any]] = {}

for k in all_direct_ks:

if k == ref_period:

event_study_effects[k] = {

"effect": 0.0,

"se": 0.0,

"n_obs": 0,

"t_stat": float("nan"),

"p_value": float("nan"),

"conf_int": (0.0, 0.0),

}

elif ("direct", None, k) in per_coef:

r = per_coef[("direct", None, k)]

event_study_effects[k] = {

"effect": r["coef"],

"se": r["se"],

"n_obs": r["n_obs"],

"t_stat": r["t_stat"],

"p_value": r["p_value"],

"conf_int": (r["ci_low"], r["ci_high"]),

}

else:

event_study_effects[k] = {

"effect": float("nan"),

"se": float("nan"),

"n_obs": 0,

"t_stat": float("nan"),

"p_value": float("nan"),

"conf_int": (float("nan"), float("nan")),

}

# Scalar att via sample-share-weighted average over post-treatment direct

# coefficients (k >= 0). SE via linear-combination on the vcov submatrix

# of those kept columns.

# Fail-closed contract (PR #456 R1 fix): if ANY post-treatment direct

# coefficient is NaN (solve_ols dropped the column as rank-deficient),

# the aggregate is structurally unidentified. Set att = NaN with a

# warning rather than silently zeroing the dropped column's contribution

# via np.nansum (which would change the point estimate without

# renormalizing weights). Matches the library-wide

# `feedback_no_silent_failures` invariant.

post_direct_indices = [

i for i, (s, _, k) in enumerate(kept_col_meta) if s == "direct" and k >= 0

]

if post_direct_indices and vcov is not None:

n_obs_post = np.array([n_obs_per_col[i] for i in post_direct_indices], dtype=np.float64)

total_post_obs = n_obs_post.sum()

coefs_post = np.array([coef[i] for i in post_direct_indices], dtype=np.float64)

has_nan_post = bool(np.any(~np.isfinite(coefs_post)))

if has_nan_post:

warnings.warn(

"SpilloverDiD event-study: scalar `att` is NaN because at "

"least one post-treatment direct-effect coefficient was "

"dropped as rank-deficient (or otherwise non-finite). The "

"aggregate is unidentified under this design; inspect "

"`att_dynamic` for the per-event-time coefficients and "

"re-aggregate manually if appropriate.",

UserWarning,

stacklevel=2,

)

att = float("nan")

att_se = float("nan")

elif total_post_obs > 0:

weights = n_obs_post / total_post_obs

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

spillover.py

Latest commit

History

spillover.py

File metadata and controls