About & Data
Colophon
About & Data
Data Provenance — globalmilitary v1 conflict-losses tracker
Scope: BMRL-48 research deliverable. Seeds the v1 conflict-losses tracker with normalized, citable OSINT equipment-loss data. Personal / non-commercial homelab use only.
Clean-room rule: we do not scrape or mirror globalmilitary.net. Their compiled dataset is their product. We build our own schema and seed from independent, citable open OSINT (community Oryx mirrors, UCDP), with provenance on every record.
1. v1 conflict selection
| Conflict | Status | Rationale |
|---|---|---|
| Russian invasion of Ukraine (2022–present) | ✅ Included | Best-documented modern conflict for visually-confirmed equipment losses. The Oryx visual-confirmation methodology has multiple maintained, structured community mirrors (CSV/JSON), so we can ingest without touching the primary blog HTML. High coverage across all equipment categories and both belligerents. |
| Nagorno-Karabakh (2020) | ⏸ Deferred | Oryx covered it, but as a static blog post with no maintained structured community mirror. Ingesting it would mean scraping the Oryx post directly, which we avoid on clean-room/ToS grounds. Deferred until a citable structured mirror is identified or a manual hand-transcription pass is scheduled. |
| Other conflicts (Syria, Sudan, etc.) | ❌ Excluded from v1 | Sparse or non-systematic visual-confirmation datasets; would dilute data quality. |
v1 ships one conflict: Russia–Ukraine. This satisfies the "at least one conflict"
bar with the highest available data quality, and the schema + ingest path generalize to
additional conflicts by adding entries to data/raw/reference.json and
scripts/sources.json.
2. Sources
| Source | URL | Access | Coverage | Cadence | License / terms | Republish decision |
|---|---|---|---|---|---|---|
| leedrake5/Russia-Ukraine (Oryx mirror) | https://github.com/leedrake5/Russia-Ukraine | raw.githubusercontent.com CSV over HTTPS |
Category-level cumulative losses, both belligerents, by status (destroyed/abandoned/captured/damaged), daily snapshots | Daily (community-maintained) | Repository: MIT (code + CSV layout). Underlying data: derived from Oryx, which is open-source visual confirmation published non-commercially. | OK to republish for non-commercial homelab use, with attribution to both the mirror and Oryx. Not for commercial use. |
| Oryx — "Attack On Europe: Documenting Equipment Losses…" (upstream) | https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html | Not fetched — recorded as evidence_ref only |
Primary visual-confirmation list (per-unit photo/video evidence) | Continuous | Blog content; non-commercial attribution; methodology = independently verified photo/video evidence only | Cited as upstream evidence, not scraped. Attribution required on any published view. |
| UCDP (Uppsala Conflict Data Program) | https://ucdp.uu.se/ | Web / API (used here for conflict metadata only) | Conflict identity, start dates, actors | Annual + live | CC-BY 4.0 — freely republishable with attribution | OK to republish with attribution. Used to corroborate conflict/belligerent metadata in reference.json. |
| ACLED | https://acleddata.com/ | Free account → CSV | Event-level armed-conflict events (not equipment losses) | Weekly | CC-BY-NC (non-commercial) | Acceptable for homelab; not used in v1 seed (event data, not equipment losses). Candidate for a future conflict-timeline overlay. |
| WarSpotting | https://ukr.warspotting.net/ | HTML; no documented bulk export/API at evaluation time | Photo-confirmed losses, per-unit | Continuous | ToS unclear for bulk reuse | Excluded from v1 pending an explicit export path + ToS confirmation. Per-unit granularity makes it a strong future source. |
License bottom line: the v1 seed combines an MIT-licensed mirror of non-commercial Oryx visual-confirmation data plus CC-BY UCDP metadata. The republish decision for the live site is non-commercial only, with source attribution visible. This matches the project's stated personal/non-commercial homelab scope.
3. Methodology
Field mapping (source → our schema)
The seed source is one pinned daily snapshot:
data/raw/leedrake5_byType_2024-11-19.csv (sha256 pinned in scripts/sources.json).
Its rows are category-level cumulative totals per country.
| Source column | Our target | Notes |
|---|---|---|
country (Russia / Ukraine) |
belligerent |
Mapped via reference.json → Russian Federation (invader, RU) / Ukraine (defender, UA). |
equipment_type (e.g. Self-Propelled Artillery) |
equipment_type.name + category |
Category bucket assigned by reference.json category_map. |
destroyed / damaged / abandoned / captured |
one loss_record per non-zero status |
status + count. Zero-count statuses are dropped. |
| (snapshot date) | loss_record.event_period |
Set to 2022-02-24/2024-11-19 (conflict start → snapshot). These are cumulative aggregates, so event_date is null. |
| (mirror URL) | loss_record.source_url |
Never null — provenance gate. |
| (Oryx post) | loss_record.evidence_ref |
Upstream visual-confirmation source. |
Category normalization
Source category names are mapped to our fixed enum
(aircraft, armour, naval, artillery, air_defence, logistics, infantry, uav, other)
in reference.json → category_map. The two belligerents use slightly different source
labels for some buckets (e.g. Ukraine Radars And Communications Equipment vs Russia
Radars; Ukraine Naval Ships vs Russia Naval Ships and Submarines). We preserve
the source name as equipment_type.name (provenance fidelity) and unify only the
category. Any source category absent from the map is skipped with a warning — it
never silently lands in the seed.
Aggregate-row exclusion
Oryx's byType files include roll-up rows (All Types, Losses excluding Recon Drones and Trucks…, Losses of Armoured Combat Vehicles…). These are sums, not categories,
and are excluded via reference.json → summary_rows to avoid double counting.
Deduplication & determinism
scripts/ingest.py is idempotent: IDs are assigned from sorted natural keys, output rows
are sorted, and ingested_at is derived from the source snapshot_date (not wall-clock),
so two runs produce byte-identical CSVs. python3 scripts/ingest.py --check rebuilds to a
temp dir and diffs against the committed seed (CI gate against drift).
4. Data quality notes
- Cumulative, not per-event. v1 loss records are category-level cumulative totals as
of the snapshot date, not individual dated loss events. The schema supports per-event
rows (
event_dateset) for finer future sources; v1 usesevent_period. - Oryx counting convention. Oryx counts individual visually-confirmed losses (one photo/video per unit), which is a methodological floor, not a battle-claim total. Real losses are higher than visually confirmed losses. Never present these as total attrition.
damagedis conservative. Oryx records "damaged" only with clear visual evidence; many damaged-then-repaired units are not counted.- Date ambiguity. The cumulative snapshot has no per-record event dates; the rollup view buckets everything into the snapshot month. Per-event dating requires a per-unit source (e.g. WarSpotting) — future work.
- Snapshot is frozen at 2024-11-19. Numbers are stale relative to live Oryx. Refresh via the documented path (§6) to advance the snapshot.
- Naval/EW edge categories map to
otherwhere no clean enum bucket fits (e.g.Jammers And Deception Systems,Unmanned Ground Vehicles).
v1 seed totals (snapshot 2024-11-19): 166 loss records, 27 equipment types, 1 conflict, 2 belligerents, ~26,255 visually-confirmed unit-losses accounted.
5. Non-commercial use note
This dataset is assembled for personal, non-commercial homelab use. The underlying
Oryx visual-confirmation data is published non-commercially; UCDP is CC-BY. Any
deployed view of this data (e.g. globalmilitary.blackmesalabs.org) must:
- Display visible attribution to Oryx (primary visual confirmation) and the leedrake5/Russia-Ukraine mirror.
- Attribute UCDP for conflict metadata.
- Carry a non-commercial-use notice and a note that figures are visually-confirmed minima, not total losses.
Do not repackage or sell this dataset. Do not scrape globalmilitary.net.
6. Refresh cadence
The seed is a pinned snapshot for reproducibility. To advance it:
# 1. Edit scripts/sources.json: bump snapshot_date + url to a newer byType file
# (https://github.com/leedrake5/Russia-Ukraine/tree/main/data/byType),
# and update `local` to data/raw/leedrake5_byType_<DATE>.csv
# 2. Re-fetch, verify, re-pin checksums, and rewrite the seed:
python3 scripts/ingest.py --refresh
# 3. Sanity-check & confirm idempotency:
python3 scripts/ingest.py --check
# 4. Review the diff in data/seed/ and data/raw/, then commit.
--refresh fetches with a browser User-Agent (several OSINT mirrors 403 default
agents) and re-pins the sha256 in sources.json. Default runs are offline — they
read the committed data/raw/ snapshot and verify its checksum, so the build is
reproducible without network access.
Suggested cadence: monthly, or before a deploy. Oryx itself updates daily; we pin to avoid a moving dataset under the app.
File map
| Path | What |
|---|---|
data/schema/schema.sql |
SQLite DDL (4 tables + loss_rollup view) |
data/schema/schema.json |
TypeScript-compatible JSON Schema mirror |
data/raw/leedrake5_byType_2024-11-19.csv |
Pinned upstream snapshot (sha256 in sources.json) |
data/raw/reference.json |
Hand-authored conflict/belligerent metadata + category map |
data/seed/*.csv |
Generated normalized seed (conflicts, belligerents, equipment_types, loss_records) |
scripts/sources.json |
Pinned source URLs, licenses, checksums |
scripts/ingest.py |
Idempotent ingest (stdlib only; --refresh, --check) |