-
Notifications
You must be signed in to change notification settings - Fork 38
Expand file tree
/
Copy pathspillover.py
More file actions
3000 lines (2750 loc) · 136 KB
/
spillover.py
File metadata and controls
3000 lines (2750 loc) · 136 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
SpilloverDiD — Butts (2021) ring-indicator spillover-aware DiD.
Augments a two-stage Gardner (2022) DiD with ring-indicator covariates that
identify the spillover effect on near-control units alongside the direct
effect on treated units. Handles both panel non-staggered and Section 5
staggered timing in a single estimator.
References
----------
Butts, K. (2023). Difference-in-Differences with Spatial Spillovers.
arXiv:2105.03737v3 (originally posted 2021).
Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.
Notes
-----
The paper's notation in Equation 5/6 is ``(1 - D_it) * Ring_{ij}`` with
``S_i`` unit-static. Reading that literally under a two-way fixed effects
specification yields a rank-deficient design (``(1 - D_it) * S_i = S_i -
D_it``; ``S_i`` is absorbed by ``mu_i``, leaving ``-D_it``). The paper
defines ``S_it = S_i * 1{t >= t_treat}`` (page 12, just above Equation 5)
and Section 5's Table 2 makes the time-varying form explicit
(``S^k_{it}``, ``Ring^k_{it,j}``). This implementation uses the
time-varying form, which is the spec that supports the paper's
identification argument (Proposition 2.3 + Section 3.1 subsample logic).
"""
import warnings
from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union
import numpy as np
import pandas as pd
from scipy import sparse
from diff_diff.conley import (
_CONLEY_EARTH_RADIUS_KM,
_CONLEY_SPARSE_N_THRESHOLD,
_haversine_km,
_validate_callable_metric_result,
)
from diff_diff.linalg import solve_ols
from diff_diff.results import SpilloverDiDResults
from diff_diff.two_stage import _compute_gmm_corrected_meat
from diff_diff.utils import safe_inference
# Type alias mirroring diff_diff.conley.ConleyMetric so callers can supply
# any of the built-in identifiers or a user callable returning a pairwise
# distance matrix.
SpilloverMetric = Union[
Literal["haversine", "euclidean"],
Callable[[np.ndarray, np.ndarray], np.ndarray],
]
# =============================================================================
# Ring construction helpers (Step 1)
# =============================================================================
def _haversine_km_pairwise(
coords_a: np.ndarray,
coords_b: np.ndarray,
) -> np.ndarray:
"""Vectorized pairwise great-circle distance (km) between two coord sets.
Parameters
----------
coords_a : ndarray of shape (n_a, 2)
``(lat, lon)`` in DEGREES for the first set of points.
coords_b : ndarray of shape (n_b, 2)
``(lat, lon)`` in DEGREES for the second set of points.
Returns
-------
ndarray of shape (n_a, n_b)
Great-circle distances in km. Matches the ``_haversine_km`` Earth
radius convention (6371.01 km, mirroring R ``conleyreg``).
"""
lat_a = coords_a[:, 0][:, None]
lon_a = coords_a[:, 1][:, None]
lat_b = coords_b[:, 0][None, :]
lon_b = coords_b[:, 1][None, :]
return _haversine_km(lat_a, lon_a, lat_b, lon_b)
def _euclidean_pairwise(
coords_a: np.ndarray,
coords_b: np.ndarray,
) -> np.ndarray:
"""Vectorized pairwise Euclidean distance between two coord sets.
Coordinates are treated as planar; no unit conversion. Matches the
``_pairwise_distance_matrix`` Euclidean branch of ``conley.py``.
"""
diffs = coords_a[:, None, :] - coords_b[None, :, :]
return np.sqrt(np.einsum("ijk,ijk->ij", diffs, diffs))
def _apply_callable_metric_pairwise(
metric: Callable[[np.ndarray, np.ndarray], np.ndarray],
coords_a: np.ndarray,
coords_b: np.ndarray,
) -> np.ndarray:
"""Apply a user-supplied callable metric to two coord sets.
Unlike :func:`_validate_callable_metric_result` which checks square
``(n, n)`` symmetry on a single coord set, this helper accepts a
rectangular ``(n_a, n_b)`` result. The validator is therefore relaxed:
we only require finiteness, non-negativity, and correct shape. The
zero-diagonal / symmetry checks apply only when the same coord set is
passed on both sides; ring-construction usage passes a treated-only
subset on side B, so the diagonal of the rectangular result is not
meaningful.
"""
result = metric(coords_a, coords_b)
arr = np.asarray(result, dtype=np.float64)
expected_shape = (coords_a.shape[0], coords_b.shape[0])
if arr.shape != expected_shape:
raise ValueError(
"conley_metric callable returned shape "
f"{arr.shape} for pairwise ring distance, expected {expected_shape}."
)
if not np.isfinite(arr).all():
raise ValueError(
"conley_metric callable returned non-finite entries for pairwise "
"ring distance; all distances must be finite."
)
if (arr < 0.0).any():
raise ValueError(
"conley_metric callable returned negative entries for pairwise "
"ring distance; all distances must be non-negative."
)
return arr
def _pairwise_ring_distances(
coords_units: np.ndarray,
coords_treated: np.ndarray,
metric: SpilloverMetric,
) -> np.ndarray:
"""Compute (n_units, n_treated) pairwise distances under the chosen metric."""
if callable(metric):
return _apply_callable_metric_pairwise(metric, coords_units, coords_treated)
if metric == "haversine":
return _haversine_km_pairwise(coords_units, coords_treated)
if metric == "euclidean":
return _euclidean_pairwise(coords_units, coords_treated)
raise ValueError(
f"Unknown conley_metric: {metric!r}. Expected 'haversine', 'euclidean', "
"or a callable returning a pairwise distance matrix."
)
def _compute_nearest_treated_distance_static(
data: pd.DataFrame,
*,
unit: str,
coords: Tuple[str, str],
metric: SpilloverMetric,
treated_unit_ids: np.ndarray,
cutoff_km: Optional[float] = None,
) -> Tuple[np.ndarray, np.ndarray]:
"""Return per-unit nearest-treated distance for the non-staggered case.
The set of treated units is fixed (ever-treated), so distances are
unit-level constants and don't vary across periods. Caller broadcasts
to per-row when assembling ring covariates.
Parameters
----------
data : pd.DataFrame
Panel data with one row per (unit, period). Used to extract
per-unit coords via :meth:`DataFrame.drop_duplicates` on ``unit``.
unit : str
Column name of the unit identifier.
coords : tuple of (str, str)
``(lat_col, lon_col)``.
metric : "haversine" | "euclidean" | callable
Distance metric. For ``"haversine"``, ``coords`` is interpreted as
``(lat, lon)`` in degrees. For ``"euclidean"``, ``coords`` is
planar. Callable receives two ``(n, 2)`` arrays and must return an
``(n_a, n_b)`` finite non-negative distance matrix.
treated_unit_ids : ndarray
IDs of ever-treated units (used as side B of pairwise distance).
cutoff_km : float, optional
If set and ``len(unit_index) > _CONLEY_SPARSE_N_THRESHOLD``, the
sparse cKDTree path is used to find treated neighbors within
``cutoff_km`` per unit; otherwise the dense (n_units × n_treated)
matrix is built. Units with no treated neighbor within ``cutoff_km``
receive ``d_i = inf`` (they fall outside any ring and into the
far-away control group, identical to dense-path behavior with
infinite distance to the nearest reached treated unit).
Returns
-------
d_i : ndarray of shape (n_unique_units,)
``d_i = min_{k in treated_unit_ids} d(i, k)`` per unique unit.
unit_index : ndarray of shape (n_unique_units,)
Unit identifiers in the same order as ``d_i``.
"""
unit_coords_df = (
data[[unit, coords[0], coords[1]]]
.drop_duplicates(subset=[unit])
.set_index(unit)
.sort_index()
)
unit_index = np.asarray(unit_coords_df.index.values)
all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)
treated_set = set(treated_unit_ids.tolist())
treated_mask = np.array([uid in treated_set for uid in unit_index], dtype=bool)
treated_coords = all_coords[treated_mask]
if treated_coords.shape[0] == 0:
raise ValueError(
"_compute_nearest_treated_distance_static: no treated units present "
"in `data` matching `treated_unit_ids`."
)
n_units = all_coords.shape[0]
is_builtin_metric = metric in ("haversine", "euclidean")
if cutoff_km is not None and n_units > _CONLEY_SPARSE_N_THRESHOLD and is_builtin_metric:
d_i = _compute_nearest_treated_distance_sparse(
all_coords=all_coords,
treated_coords=treated_coords,
metric=metric, # type: ignore[arg-type]
cutoff_km=float(cutoff_km),
)
else:
# Dense path: full pairwise matrix, then row-min.
dists = _pairwise_ring_distances(all_coords, treated_coords, metric)
d_i = dists.min(axis=1)
return d_i.astype(np.float64), unit_index
def _compute_nearest_treated_distance_sparse(
*,
all_coords: np.ndarray,
treated_coords: np.ndarray,
metric: Literal["haversine", "euclidean"],
cutoff_km: float,
) -> np.ndarray:
"""Sparse cKDTree path for nearest-treated-distance computation.
Used when ``n_units > _CONLEY_SPARSE_N_THRESHOLD`` AND the metric is a
built-in string. The tree is built on the treated subset (small) and
queried with each unit row. Units with no treated neighbor inside
``cutoff_km`` get ``d_i = inf``, which places them in the far-away
control group on the downstream ring-membership step.
For haversine: lat/lon are projected to 3-D unit-sphere Cartesian
coordinates; the chord-distance query radius is
``2 * sin(arc / (2 * R_earth))`` with arc clamped at ``pi * R_earth``
so cutoffs beyond a hemisphere don't shrink. Exact great-circle
distances are then recomputed via :func:`_haversine_km` for the in-
range matches and the per-row minimum is taken.
For euclidean: planar L2 directly in cKDTree.
Parameters
----------
all_coords : ndarray of shape (n_units, 2)
Coordinates for all units.
treated_coords : ndarray of shape (n_treated, 2)
Coordinates for ever-treated units.
metric : 'haversine' or 'euclidean'
Built-in metric only; callables fall back to the dense path.
cutoff_km : float
Maximum considered distance. Units beyond this get ``d_i = inf``.
Returns
-------
ndarray of shape (n_units,)
Nearest-treated distance per unit (inf when no neighbor in range).
"""
# Imported lazily to mirror conley.py's lazy-scipy pattern and keep
# module import cheap when the sparse path isn't exercised.
from scipy.spatial import cKDTree # noqa: WPS433 (deferred import)
n_units = all_coords.shape[0]
if metric == "haversine":
# Project lat/lon (degrees) to 3-D unit-sphere Cartesian.
lat_rad_all = np.radians(all_coords[:, 0])
lon_rad_all = np.radians(all_coords[:, 1])
unit_xyz = np.column_stack(
[
np.cos(lat_rad_all) * np.cos(lon_rad_all),
np.cos(lat_rad_all) * np.sin(lon_rad_all),
np.sin(lat_rad_all),
]
)
lat_rad_tr = np.radians(treated_coords[:, 0])
lon_rad_tr = np.radians(treated_coords[:, 1])
tree_xyz = np.column_stack(
[
np.cos(lat_rad_tr) * np.cos(lon_rad_tr),
np.cos(lat_rad_tr) * np.sin(lon_rad_tr),
np.sin(lat_rad_tr),
]
)
# Chord-distance radius for the query; clamp arc at pi (a half-revolution)
# so cutoffs > pi * R_earth do not shrink chord radius below the true reach.
arc_radians = min(cutoff_km / _CONLEY_EARTH_RADIUS_KM, np.pi)
query_r = 2.0 * np.sin(arc_radians / 2.0)
query_r *= 1.0 + 1e-12 # numerical safety margin
tree = cKDTree(tree_xyz)
# Query in chord space, recompute exact great-circle distance for matches.
neighbors = tree.query_ball_point(unit_xyz, r=query_r, p=2.0)
d_i = np.full(n_units, np.inf, dtype=np.float64)
for i, idxs in enumerate(neighbors):
if not idxs:
continue
# Exact great-circle distance for the in-range treated neighbors.
arr_idxs = np.asarray(idxs, dtype=np.intp)
d_subset = _haversine_km(
all_coords[i, 0],
all_coords[i, 1],
treated_coords[arr_idxs, 0],
treated_coords[arr_idxs, 1],
)
d_i[i] = float(d_subset.min())
return d_i
# Euclidean: cKDTree handles directly.
tree = cKDTree(treated_coords)
d_i = np.full(n_units, np.inf, dtype=np.float64)
neighbors = tree.query_ball_point(all_coords, r=cutoff_km, p=2.0)
for i, idxs in enumerate(neighbors):
if not idxs:
continue
arr_idxs = np.asarray(idxs, dtype=np.intp)
d_subset = _euclidean_pairwise(all_coords[i : i + 1], treated_coords[arr_idxs])
d_i[i] = float(d_subset.min())
return d_i
def _compute_nearest_treated_distance_staggered(
data: pd.DataFrame,
*,
unit: str,
time: str,
coords: Tuple[str, str],
metric: SpilloverMetric,
first_treat_by_unit: Dict[Any, Any],
d_bar: Optional[float] = None,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, Optional[np.ndarray]]:
"""Return per-row nearest-treated distance for the staggered case.
For each (unit, period) observation, find the minimum distance to any
unit that is treated BY THE END of that period (``first_treat_k <=
t``). Ring membership in the staggered case is therefore unit-time
varying.
Parameters
----------
data : pd.DataFrame
Panel data (one row per unit-period).
unit : str
Unit identifier column name.
time : str
Time period column name.
coords : tuple of (str, str)
``(lat_col, lon_col)``.
metric : "haversine" | "euclidean" | callable
Distance metric.
first_treat_by_unit : dict
Mapping from unit identifier to onset time (or ``np.inf`` for
never-treated). Generated by :func:`_extract_treatment_onsets`.
d_bar : float, optional
When supplied, the function additionally computes the per-row
**spillover-trigger onset** (earliest cohort onset whose treated
units fall within ``d_bar`` of unit ``i``) reusing the cohort
loop. Used by :func:`_compute_event_time_per_row` to avoid a
duplicate cohort pass on the event-study path
(PR #456 R6 performance fix).
Notes
-----
The staggered helper currently always uses dense pairwise distance per
cohort. A sparse cKDTree branch (mirroring the static helper) is queued
as a follow-up — see TODO.md.
Returns
-------
d_it : ndarray of shape (n_rows,)
Per-row nearest-treated distance, with ``inf`` for rows where no
unit has been treated yet by time t (early periods).
row_unit : ndarray of shape (n_rows,)
Aligned unit identifier per row (for downstream broadcasting).
row_time : ndarray of shape (n_rows,)
Aligned time identifier per row.
trigger_onset_per_row : ndarray of shape (n_rows,) or None
``None`` when ``d_bar`` is None. Otherwise: per-row earliest
cohort onset whose treated units fall within ``d_bar`` of the
row's unit, broadcast from per-unit. NaN for rows whose unit is
never within ``d_bar`` of any cohort.
"""
unit_coords_df = (
data[[unit, coords[0], coords[1]]].drop_duplicates(subset=[unit]).set_index(unit)
)
unit_index = np.asarray(unit_coords_df.index.values)
all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)
unit_to_pos = {uid: pos for pos, uid in enumerate(unit_index)}
row_unit = np.asarray(data[unit].values)
row_time = np.asarray(data[time].values)
n_rows = len(row_unit)
d_it = np.full(n_rows, np.inf, dtype=np.float64)
trigger_onset_per_unit_pos: Optional[np.ndarray] = (
np.full(len(unit_index), np.nan, dtype=np.float64) if d_bar is not None else None
)
# Determine the cohort onset times that exist in the data (excluding never-treated).
unique_onsets = sorted({ft for ft in first_treat_by_unit.values() if np.isfinite(ft)})
if not unique_onsets:
# Degenerate: no treated units. Caller should have rejected this
# in `_validate_spillover_inputs`, but defensively return inf.
return d_it, row_unit, row_time, None
# Row's unit position. Invariant across cohort iterations — compute
# once outside the loop.
row_pos = np.array([unit_to_pos[uid] for uid in row_unit], dtype=np.intp)
# For each unique onset time, compute (n_units, n_treated_by_then) pairwise
# distances ONCE, then assign to rows whose t >= that onset (carrying forward
# the minimum across cohorts).
for onset in unique_onsets:
treated_by_onset_ids = [uid for uid, ft in first_treat_by_unit.items() if ft <= onset]
treated_positions = np.array(
[unit_to_pos[uid] for uid in treated_by_onset_ids if uid in unit_to_pos],
dtype=np.intp,
)
if treated_positions.size == 0:
continue
treated_coords = all_coords[treated_positions]
# Compute per-unit nearest distance to this cohort's treated set.
dists_to_cohort = _pairwise_ring_distances(all_coords, treated_coords, metric).min(axis=1)
# Update rows whose period t >= onset: take min of current d_it and the
# newly-available cohort distance.
affected_rows = row_time >= onset
if not affected_rows.any():
continue
row_cohort_dist = dists_to_cohort[row_pos]
# Only update rows where this cohort's distance is smaller than the
# current d_it (carries the running minimum across cohorts).
update_mask = affected_rows & (row_cohort_dist < d_it)
d_it[update_mask] = row_cohort_dist[update_mask]
# Reuse this same cohort distance computation for the per-unit
# spillover-trigger onset when d_bar is supplied. The trigger is
# the FIRST cohort whose treated units fall within d_bar of unit
# i — once locked it persists for later cohort iterations. Using
# cumulative-treated distances here is fine: if a unit is in
# range of cohort c1, dists_to_cohort at onset=c1 already detects
# it; later iterations with extra treated units only shrink the
# distance, never grow it back above d_bar.
if trigger_onset_per_unit_pos is not None:
in_range_for_cohort = dists_to_cohort <= d_bar
not_yet_triggered = np.isnan(trigger_onset_per_unit_pos)
trigger_onset_per_unit_pos[in_range_for_cohort & not_yet_triggered] = onset
# Broadcast per-unit trigger to rows when computed.
if trigger_onset_per_unit_pos is not None:
trigger_onset_per_row = trigger_onset_per_unit_pos[row_pos]
else:
trigger_onset_per_row = None
return d_it, row_unit, row_time, trigger_onset_per_row
def _compute_event_time_per_row(
*,
data: pd.DataFrame,
unit: str,
row_unit: np.ndarray,
row_time: np.ndarray,
effective_onsets: Dict[Any, float],
coords: Tuple[str, str],
metric: SpilloverMetric,
d_bar: float,
precomputed_trigger_onset_per_row: Optional[np.ndarray] = None,
) -> Tuple[np.ndarray, np.ndarray]:
"""Compute two event-time clocks per row for Wave C event-study mode.
Butts (2021) Section 5 / Table 2 uses one symbol ``K_it`` but operationally
there are TWO event-time clocks — one for the direct-effect series and one
for the spillover-exposure series. This helper returns both.
- ``K_direct[r] = row_time[r] - effective_onsets[row_unit[r]]`` for rows of
ever-treated units (any t, including pre-treatment k < 0 for placebo
coefficients). NaN for never-treated units.
- ``K_spill[r] = row_time[r] - trigger_onset[row_unit[r]]`` for rows where
the spillover-trigger cohort has activated by ``row_time[r]``. NaN
otherwise. ``trigger_onset[i]`` is the EARLIEST effective onset among
cohorts whose treated units fall within ``d_bar`` of unit ``i``.
Cohort onsets are iterated in ascending order so the trigger is the first
cohort that puts unit ``i`` in any ring — matching the running-min logic
used by :func:`_compute_nearest_treated_distance_staggered` for ``d_it``.
Parameters
----------
data : pd.DataFrame
Panel data; used to extract one (lat, lon) coordinate per unit.
unit, coords, metric, d_bar
Mirror :func:`_compute_nearest_treated_distance_staggered`.
row_unit, row_time : ndarray of shape (n_rows,)
Per-row identifiers (anticipation-adjusted onsets are baked into
``effective_onsets``; row_time is the raw period).
effective_onsets : dict
Mapping from unit identifier to anticipation-shifted first_treat
(``first_treat - anticipation``). ``np.inf`` for never-treated units.
Returns
-------
K_direct : ndarray of shape (n_rows,), float64 with NaN where undefined.
K_spill : ndarray of shape (n_rows,), float64 with NaN where undefined.
Notes
-----
PR #456 R6 performance fix: when ``precomputed_trigger_onset_per_row``
is supplied (as :func:`_compute_nearest_treated_distance_staggered`
now optionally returns when called with ``d_bar=...``), the cohort
loop is skipped — K_spill is derived directly from the precomputed
trigger. The fallback (compute trigger inline) is kept for unit-test
callers and other code paths that don't have access to the staggered
distance helper's output.
"""
n_rows = len(row_unit)
row_time_arr = np.asarray(row_time, dtype=np.float64)
# K_direct: per-row, derived from row_unit -> own effective_onset.
K_direct = np.full(n_rows, np.nan, dtype=np.float64)
own_onsets = np.array([effective_onsets.get(uid, np.inf) for uid in row_unit], dtype=np.float64)
direct_defined = np.isfinite(own_onsets)
K_direct[direct_defined] = row_time_arr[direct_defined] - own_onsets[direct_defined]
if precomputed_trigger_onset_per_row is not None:
# Fast path: reuse trigger onsets already computed by the staggered
# distance helper. Avoids a duplicate cohort loop.
row_trigger = np.asarray(precomputed_trigger_onset_per_row, dtype=np.float64)
K_spill = np.full(n_rows, np.nan, dtype=np.float64)
triggered = np.isfinite(row_trigger)
post_trigger = triggered & (row_time_arr >= row_trigger)
K_spill[post_trigger] = row_time_arr[post_trigger] - row_trigger[post_trigger]
return K_direct, K_spill
# Fallback path (test callers, etc.): compute trigger inline via own
# cohort loop. trigger_onset[i] = first effective_onset among cohorts
# whose treated units have d(i, treated_in_cohort) <= d_bar.
unit_coords_df = (
data[[unit, coords[0], coords[1]]].drop_duplicates(subset=[unit]).set_index(unit)
)
unit_index = np.asarray(unit_coords_df.index.values)
all_coords = np.asarray(unit_coords_df[[coords[0], coords[1]]].values, dtype=np.float64)
unit_to_pos = {uid: pos for pos, uid in enumerate(unit_index)}
unique_onsets = sorted({eff_ft for eff_ft in effective_onsets.values() if np.isfinite(eff_ft)})
trigger_onset_per_unit_pos = np.full(len(unit_index), np.nan, dtype=np.float64)
for onset in unique_onsets:
# Units treated AT THIS ONSET (not by/before; we want the cohort's
# own treated set so we can compute the per-onset distance front).
treated_at_onset_ids = [uid for uid, ft in effective_onsets.items() if ft == onset]
treated_positions = np.array(
[unit_to_pos[uid] for uid in treated_at_onset_ids if uid in unit_to_pos],
dtype=np.intp,
)
if treated_positions.size == 0:
continue
treated_coords = all_coords[treated_positions]
dists_to_cohort = _pairwise_ring_distances(all_coords, treated_coords, metric).min(axis=1)
in_range_for_cohort = dists_to_cohort <= d_bar
not_yet_triggered = np.isnan(trigger_onset_per_unit_pos)
trigger_onset_per_unit_pos[in_range_for_cohort & not_yet_triggered] = onset
# Broadcast trigger onset to rows; K_spill = t - trigger when t >= trigger.
row_pos = np.array([unit_to_pos.get(uid, -1) for uid in row_unit], dtype=np.intp)
K_spill = np.full(n_rows, np.nan, dtype=np.float64)
valid_pos = row_pos >= 0
row_trigger = np.where(
valid_pos, trigger_onset_per_unit_pos[np.where(valid_pos, row_pos, 0)], np.nan
)
triggered = np.isfinite(row_trigger)
post_trigger = triggered & (row_time_arr >= row_trigger)
K_spill[post_trigger] = row_time_arr[post_trigger] - row_trigger[post_trigger]
return K_direct, K_spill
def _apply_horizon_binning(
K_arr: np.ndarray,
horizon_max: Optional[int],
) -> np.ndarray:
"""Clip per-row event-time values into ``[-horizon_max, +horizon_max]`` bins.
Wave C event-study path uses bin-into-endpoint-pools semantics: rows with
event-time ``k < -H`` aggregate into a single ``k = -H`` dummy; rows with
``k > +H`` aggregate into a single ``k = +H`` dummy. No observations are
dropped (cf. TwoStageDiD's ``horizon_max`` which filters rows).
NaN values in ``K_arr`` propagate through (``np.clip`` preserves NaN by
default). Omega_0 / never-treated rows carry NaN K values, which cause
``1{K_binned = k}`` to evaluate False at every k — so they contribute 0
to all event-time dummies (correct identification: those rows enter
stage 1 only, not the event-time decomposition).
Parameters
----------
K_arr : ndarray
Per-row event-time values. NaN entries are passed through unchanged.
horizon_max : int or None
Bin width; if ``None``, no clipping (used for auto-detect path where
``H = max(|K_it|)`` provides the natural bound).
Returns
-------
ndarray of same shape and dtype as input, with NaN-preserving clamp applied.
"""
if horizon_max is None:
return K_arr.astype(np.float64, copy=False)
if not isinstance(horizon_max, (int, np.integer)) or horizon_max < 0:
raise ValueError(
f"horizon_max must be a non-negative integer or None; "
f"got {horizon_max!r} (type {type(horizon_max).__name__})."
)
return np.clip(K_arr.astype(np.float64, copy=False), -float(horizon_max), float(horizon_max))
def _build_event_study_design(
*,
D_it: np.ndarray,
ring_masks: np.ndarray,
ring_labels: List[str],
K_direct_binned: np.ndarray,
K_spill_binned: np.ndarray,
event_time_grid: List[int],
ref_period: int,
) -> Tuple[
np.ndarray,
List[str],
List[Tuple[str, Optional[str], int]],
List[Tuple[str, Optional[str], int]],
np.ndarray,
]:
"""Build per-event-time × ring stage-2 design matrix for Wave C event-study.
The design has two series of dummies:
- **Direct effect**: ``D^k_{it} := 1{K_direct_{it} = k AND row is ever-treated}``
for each ``k ∈ event_time_grid \\ {ref_period}``. NaN entries in
``K_direct_binned`` (never-treated rows) cause the indicator to evaluate
False, naturally yielding zero contribution.
- **Spillover**: ``Ring^k_{it,j} := (1 - D_it) * ring_masks[:, j] * 1{K_spill_{it} = k}``
for each ring ``j`` and each ``k ∈ event_time_grid \\ {ref_period}``.
All-zero columns are pre-filtered (one summary warning instead of many),
but the FULL rectangular grid of (series, ring, k) tuples is also returned
so downstream code can emit the MultiIndex ``spillover_effects`` schema
with NaN coefficients for empty cells (per Wave C plan: rectangular).
Parameters
----------
D_it : ndarray of shape (n_rows,), float
Per-row binary indicator (treated AND post-treatment).
ring_masks : ndarray of shape (n_rows, K), bool
Per-row ring-membership indicators (from :func:`_build_ring_indicators`).
ring_labels : list of K strings
Human-readable labels for each ring band.
K_direct_binned, K_spill_binned : ndarray of shape (n_rows,), float64
Per-row event-time clocks (NaN where undefined). Already passed through
:func:`_apply_horizon_binning` if applicable.
event_time_grid : list of int
The full event-time bin set (e.g. ``[-3, -2, -1, 0, 1, 2, 3]`` for
``horizon_max=3``). Reference period is dropped from this list inside
the helper.
ref_period : int
The event-time integer to drop from BOTH series.
Returns
-------
X_2 : ndarray of shape (n_rows, n_kept_cols)
Stage-2 design matrix (only non-empty columns kept).
kept_col_names : list of str
Column labels matching X_2 columns. Convention: ``"D^k=+0"``,
``"_spillover_[0, 50)^k=-2"``, with signed integer suffix.
kept_col_meta : list of (series, ring_label_or_None, k)
Tuple metadata per kept column (``series ∈ {"direct", "spillover"}``).
rectangular_grid : list of (series, ring_label_or_None, k)
FULL grid of (series, ring, k) entries including those dropped because
the column was all zeros. Used for rectangular MultiIndex emission.
Order matches the design layout (direct first, then per-ring spillover).
n_obs_per_col : ndarray of shape (n_kept_cols,), int64
Count of rows with a non-zero contribution to each kept column.
"""
if not isinstance(ref_period, (int, np.integer)):
raise TypeError(
f"ref_period must be an integer; got {ref_period!r} "
f"(type {type(ref_period).__name__})."
)
K = ring_masks.shape[1]
if len(ring_labels) != K:
raise ValueError(
f"ring_labels length ({len(ring_labels)}) must match number of " f"rings ({K})."
)
# The grid of event-times to emit dummies for, with the reference dropped.
k_grid = [int(k) for k in event_time_grid if int(k) != int(ref_period)]
one_minus_D = 1.0 - D_it.astype(np.float64)
ring_masks_f = ring_masks.astype(np.float64)
K_direct_f = np.asarray(K_direct_binned, dtype=np.float64)
K_spill_f = np.asarray(K_spill_binned, dtype=np.float64)
def _signed(k: int) -> str:
return f"{k:+d}"
# Build candidate columns in canonical order:
# 1) all direct-effect dummies, ascending k
# 2) per-ring spillover dummies (ascending ring, ascending k within)
candidate_cols: List[Tuple[str, Optional[str], int, np.ndarray]] = []
rectangular_grid: List[Tuple[str, Optional[str], int]] = []
for k in k_grid:
# Direct-effect dummy: D_i (implicit via NaN-on-never-treated) * 1{K_direct = k}.
col = (K_direct_f == float(k)).astype(np.float64)
candidate_cols.append(("direct", None, k, col))
rectangular_grid.append(("direct", None, k))
for j in range(K):
ring_lab = ring_labels[j]
for k in k_grid:
# Spillover dummy: (1 - D_it) * Ring_j * 1{K_spill = k}.
col = one_minus_D * ring_masks_f[:, j] * (K_spill_f == float(k)).astype(np.float64)
candidate_cols.append(("spillover", ring_lab, k, col))
rectangular_grid.append(("spillover", ring_lab, k))
# Pre-filter all-zero columns to keep solve_ols's rank-deficient warning
# noise low. Track the kept set.
kept_indices: List[int] = []
kept_cols_list: List[np.ndarray] = []
kept_col_names: List[str] = []
kept_col_meta: List[Tuple[str, Optional[str], int]] = []
n_obs_list: List[int] = []
n_dropped = 0
for idx, (series, ring_lab, k, col) in enumerate(candidate_cols):
n_nonzero = int(np.count_nonzero(col))
if n_nonzero == 0:
n_dropped += 1
continue
kept_indices.append(idx)
kept_cols_list.append(col)
if series == "direct":
kept_col_names.append(f"D^k={_signed(k)}")
else:
kept_col_names.append(f"_spillover_{ring_lab}^k={_signed(k)}")
kept_col_meta.append((series, ring_lab, k))
n_obs_list.append(n_nonzero)
if n_dropped > 0:
warnings.warn(
f"SpilloverDiD event-study: {n_dropped} of "
f"{len(candidate_cols)} stage-2 design column(s) were "
"all-zero (no rows contribute) and dropped before fitting. "
"Empty (series, ring, event_time) cells appear in the result "
"with coef=NaN and n_obs=0 (rectangular schema). To shrink the "
"emitted grid, reduce horizon_max or use horizon_max=None for "
"auto-detection.",
UserWarning,
stacklevel=2,
)
if not kept_cols_list:
# All columns dropped — degenerate. Return empty design; caller
# handles via downstream df_resid check + safe_inference NaN
# propagation.
X_2 = np.zeros((len(D_it), 0), dtype=np.float64)
else:
X_2 = np.column_stack(kept_cols_list)
n_obs_per_col = np.asarray(n_obs_list, dtype=np.int64)
return X_2, kept_col_names, kept_col_meta, rectangular_grid, n_obs_per_col
def _extract_event_study_results(
*,
coef: np.ndarray,
vcov: Optional[np.ndarray],
col_names_all: List[str],
kept_col_meta: List[Tuple[str, Optional[str], int]],
rectangular_grid: List[Tuple[str, Optional[str], int]],
n_obs_per_col: np.ndarray,
ref_period: int,
df_resid: int,
alpha: float,
ring_labels: List[str],
) -> Tuple[
float,
float,
float,
float,
Tuple[float, float],
Optional[pd.DataFrame],
Optional[pd.DataFrame],
Optional[Dict[int, Dict[str, Any]]],
Dict[str, float],
]:
"""Extract per-event-time inference and the share-weighted scalar ``att``.
Builds three output surfaces from a single stage-2 fit:
- ``att_dynamic`` : per-event-time direct-effect DataFrame indexed by ``k``.
Includes the reference period row with ``coef=0.0, se=0.0, n_obs=0``.
Rectangular emission across the full direct-effect event-time grid.
- ``spillover_effects`` : MultiIndex ``(ring_label, event_time)`` DataFrame
with the same columns. Rectangular over the full spillover grid.
- ``event_study_effects`` : TwoStageDiD-compatible alias matching
``two_stage.py:1355-1389`` schema (``conf_int`` as ``(low, high)`` tuple,
reference period as ``(0.0, 0.0)``).
Scalar ``att`` uses sample-share weighting on post-treatment ``tau_k``
with SE from linear-combination inference on the corresponding vcov
submatrix.
"""
# Per-coefficient inference dict keyed by (series, ring_label, k).
per_coef: Dict[Tuple[str, Optional[str], int], Dict[str, Any]] = {}
for i, (series, ring_label, k) in enumerate(kept_col_meta):
coef_i = float(coef[i]) if np.isfinite(coef[i]) else float("nan")
if vcov is not None and np.isfinite(vcov[i, i]):
se_i = float(np.sqrt(max(vcov[i, i], 0.0)))
else:
se_i = float("nan")
t_i, p_i, ci_i = safe_inference(coef_i, se_i, alpha=alpha, df=df_resid)
per_coef[(series, ring_label, k)] = {
"coef": coef_i,
"se": se_i,
"t_stat": t_i,
"p_value": p_i,
"ci_low": ci_i[0],
"ci_high": ci_i[1],
"n_obs": int(n_obs_per_col[i]),
}
direct_k_set = sorted({k for (s, _, k) in rectangular_grid if s == "direct"})
spillover_k_set = sorted({k for (s, _, k) in rectangular_grid if s == "spillover"})
# Build att_dynamic: rectangular over direct event-time grid + reference row.
all_direct_ks = sorted(set(direct_k_set) | {ref_period})
direct_rows: List[Dict[str, Any]] = []
for k in all_direct_ks:
if k == ref_period:
direct_rows.append(
{
"k": k,
"coef": 0.0,
"se": 0.0,
"t_stat": float("nan"),
"p_value": float("nan"),
"ci_low": 0.0,
"ci_high": 0.0,
"n_obs": 0,
}
)
elif ("direct", None, k) in per_coef:
r = per_coef[("direct", None, k)]
direct_rows.append({"k": k, **r})
else:
direct_rows.append(
{
"k": k,
"coef": float("nan"),
"se": float("nan"),
"t_stat": float("nan"),
"p_value": float("nan"),
"ci_low": float("nan"),
"ci_high": float("nan"),
"n_obs": 0,
}
)
att_dynamic_df = pd.DataFrame(direct_rows).set_index("k").sort_index() if direct_rows else None
# Build spillover_effects: rectangular over (ring_label, k) grid.
#
# PR #456 R1 fix (P3): the spillover grid must INCLUDE the reference
# period row per ring. The pre-filter in _build_event_study_design drops
# `ref_period` from the fitted column set, but the rectangular schema
# for spillover must still emit (ring, ref_period) with `coef=0.0,
# se=0.0, n_obs=0` for symmetry with the direct-effect series (which
# emits its reference row at k=ref_period). Without this, consumers
# iterating `[-H, ..., +H]` would hit a missing (ring, ref_period)
# slice — the registry promises rectangular emission over the full
# event-time grid.
all_spillover_ks = sorted(set(spillover_k_set) | {ref_period})
spillover_rows: List[Dict[str, Any]] = []
for ring_lab in ring_labels:
for k in all_spillover_ks:
if k == ref_period:
# Reference-period spillover row: 0-anchored (mirrors direct).
spillover_rows.append(
{
"ring": ring_lab,
"k": k,
"coef": 0.0,
"se": 0.0,
"t_stat": float("nan"),
"p_value": float("nan"),
"ci_low": 0.0,
"ci_high": 0.0,
"n_obs": 0,
}
)
continue
key = ("spillover", ring_lab, k)
if key in per_coef:
r = per_coef[key]
spillover_rows.append({"ring": ring_lab, "k": k, **r})
else:
spillover_rows.append(
{
"ring": ring_lab,
"k": k,
"coef": float("nan"),
"se": float("nan"),
"t_stat": float("nan"),
"p_value": float("nan"),
"ci_low": float("nan"),
"ci_high": float("nan"),
"n_obs": 0,
}
)
spillover_df = (
pd.DataFrame(spillover_rows).set_index(["ring", "k"]).sort_index()
if spillover_rows
else None
)
# Build event_study_effects dict (TwoStageDiD-compatible).
event_study_effects: Dict[int, Dict[str, Any]] = {}
for k in all_direct_ks:
if k == ref_period:
event_study_effects[k] = {
"effect": 0.0,
"se": 0.0,
"n_obs": 0,
"t_stat": float("nan"),
"p_value": float("nan"),
"conf_int": (0.0, 0.0),
}
elif ("direct", None, k) in per_coef:
r = per_coef[("direct", None, k)]
event_study_effects[k] = {
"effect": r["coef"],
"se": r["se"],
"n_obs": r["n_obs"],
"t_stat": r["t_stat"],
"p_value": r["p_value"],
"conf_int": (r["ci_low"], r["ci_high"]),
}
else:
event_study_effects[k] = {
"effect": float("nan"),
"se": float("nan"),
"n_obs": 0,
"t_stat": float("nan"),
"p_value": float("nan"),
"conf_int": (float("nan"), float("nan")),
}
# Scalar att via sample-share-weighted average over post-treatment direct
# coefficients (k >= 0). SE via linear-combination on the vcov submatrix
# of those kept columns.
#
# Fail-closed contract (PR #456 R1 fix): if ANY post-treatment direct
# coefficient is NaN (solve_ols dropped the column as rank-deficient),
# the aggregate is structurally unidentified. Set att = NaN with a
# warning rather than silently zeroing the dropped column's contribution
# via np.nansum (which would change the point estimate without
# renormalizing weights). Matches the library-wide
# `feedback_no_silent_failures` invariant.
post_direct_indices = [
i for i, (s, _, k) in enumerate(kept_col_meta) if s == "direct" and k >= 0
]
if post_direct_indices and vcov is not None:
n_obs_post = np.array([n_obs_per_col[i] for i in post_direct_indices], dtype=np.float64)
total_post_obs = n_obs_post.sum()
coefs_post = np.array([coef[i] for i in post_direct_indices], dtype=np.float64)
has_nan_post = bool(np.any(~np.isfinite(coefs_post)))
if has_nan_post:
warnings.warn(
"SpilloverDiD event-study: scalar `att` is NaN because at "
"least one post-treatment direct-effect coefficient was "
"dropped as rank-deficient (or otherwise non-finite). The "
"aggregate is unidentified under this design; inspect "
"`att_dynamic` for the per-event-time coefficients and "
"re-aggregate manually if appropriate.",
UserWarning,
stacklevel=2,
)
att = float("nan")
att_se = float("nan")
elif total_post_obs > 0:
weights = n_obs_post / total_post_obs