About & Data

Colophon

About & Data

Data Provenance — globalmilitary v1 conflict-losses tracker

Scope: BMRL-48 research deliverable. Seeds the v1 conflict-losses tracker with normalized, citable OSINT equipment-loss data. Personal / non-commercial homelab use only.

Clean-room rule: we do not scrape or mirror globalmilitary.net. Their compiled dataset is their product. We build our own schema and seed from independent, citable open OSINT (community Oryx mirrors, UCDP), with provenance on every record.


1. v1 conflict selection

Conflict Status Rationale
Russian invasion of Ukraine (2022–present) ✅ Included Best-documented modern conflict for visually-confirmed equipment losses. The Oryx visual-confirmation methodology has multiple maintained, structured community mirrors (CSV/JSON), so we can ingest without touching the primary blog HTML. High coverage across all equipment categories and both belligerents.
Nagorno-Karabakh (2020) ⏸ Deferred Oryx covered it, but as a static blog post with no maintained structured community mirror. Ingesting it would mean scraping the Oryx post directly, which we avoid on clean-room/ToS grounds. Deferred until a citable structured mirror is identified or a manual hand-transcription pass is scheduled.
Other conflicts (Syria, Sudan, etc.) ❌ Excluded from v1 Sparse or non-systematic visual-confirmation datasets; would dilute data quality.

v1 ships one conflict: Russia–Ukraine. This satisfies the "at least one conflict" bar with the highest available data quality, and the schema + ingest path generalize to additional conflicts by adding entries to data/raw/reference.json and scripts/sources.json.


2. Sources

Source URL Access Coverage Cadence License / terms Republish decision
leedrake5/Russia-Ukraine (Oryx mirror) https://github.com/leedrake5/Russia-Ukraine raw.githubusercontent.com CSV over HTTPS Category-level cumulative losses, both belligerents, by status (destroyed/abandoned/captured/damaged), daily snapshots Daily (community-maintained) Repository: MIT (code + CSV layout). Underlying data: derived from Oryx, which is open-source visual confirmation published non-commercially. OK to republish for non-commercial homelab use, with attribution to both the mirror and Oryx. Not for commercial use.
Oryx — "Attack On Europe: Documenting Equipment Losses…" (upstream) https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html Not fetched — recorded as evidence_ref only Primary visual-confirmation list (per-unit photo/video evidence) Continuous Blog content; non-commercial attribution; methodology = independently verified photo/video evidence only Cited as upstream evidence, not scraped. Attribution required on any published view.
UCDP (Uppsala Conflict Data Program) https://ucdp.uu.se/ Web / API (used here for conflict metadata only) Conflict identity, start dates, actors Annual + live CC-BY 4.0 — freely republishable with attribution OK to republish with attribution. Used to corroborate conflict/belligerent metadata in reference.json.
ACLED https://acleddata.com/ Free account → CSV Event-level armed-conflict events (not equipment losses) Weekly CC-BY-NC (non-commercial) Acceptable for homelab; not used in v1 seed (event data, not equipment losses). Candidate for a future conflict-timeline overlay.
WarSpotting https://ukr.warspotting.net/ HTML; no documented bulk export/API at evaluation time Photo-confirmed losses, per-unit Continuous ToS unclear for bulk reuse Excluded from v1 pending an explicit export path + ToS confirmation. Per-unit granularity makes it a strong future source.

License bottom line: the v1 seed combines an MIT-licensed mirror of non-commercial Oryx visual-confirmation data plus CC-BY UCDP metadata. The republish decision for the live site is non-commercial only, with source attribution visible. This matches the project's stated personal/non-commercial homelab scope.


3. Methodology

Field mapping (source → our schema)

The seed source is one pinned daily snapshot: data/raw/leedrake5_byType_2024-11-19.csv (sha256 pinned in scripts/sources.json). Its rows are category-level cumulative totals per country.

Source column Our target Notes
country (Russia / Ukraine) belligerent Mapped via reference.jsonRussian Federation (invader, RU) / Ukraine (defender, UA).
equipment_type (e.g. Self-Propelled Artillery) equipment_type.name + category Category bucket assigned by reference.json category_map.
destroyed / damaged / abandoned / captured one loss_record per non-zero status status + count. Zero-count statuses are dropped.
(snapshot date) loss_record.event_period Set to 2022-02-24/2024-11-19 (conflict start → snapshot). These are cumulative aggregates, so event_date is null.
(mirror URL) loss_record.source_url Never null — provenance gate.
(Oryx post) loss_record.evidence_ref Upstream visual-confirmation source.

Category normalization

Source category names are mapped to our fixed enum (aircraft, armour, naval, artillery, air_defence, logistics, infantry, uav, other) in reference.jsoncategory_map. The two belligerents use slightly different source labels for some buckets (e.g. Ukraine Radars And Communications Equipment vs Russia Radars; Ukraine Naval Ships vs Russia Naval Ships and Submarines). We preserve the source name as equipment_type.name (provenance fidelity) and unify only the category. Any source category absent from the map is skipped with a warning — it never silently lands in the seed.

Aggregate-row exclusion

Oryx's byType files include roll-up rows (All Types, Losses excluding Recon Drones and Trucks…, Losses of Armoured Combat Vehicles…). These are sums, not categories, and are excluded via reference.jsonsummary_rows to avoid double counting.

Deduplication & determinism

scripts/ingest.py is idempotent: IDs are assigned from sorted natural keys, output rows are sorted, and ingested_at is derived from the source snapshot_date (not wall-clock), so two runs produce byte-identical CSVs. python3 scripts/ingest.py --check rebuilds to a temp dir and diffs against the committed seed (CI gate against drift).


4. Data quality notes

  • Cumulative, not per-event. v1 loss records are category-level cumulative totals as of the snapshot date, not individual dated loss events. The schema supports per-event rows (event_date set) for finer future sources; v1 uses event_period.
  • Oryx counting convention. Oryx counts individual visually-confirmed losses (one photo/video per unit), which is a methodological floor, not a battle-claim total. Real losses are higher than visually confirmed losses. Never present these as total attrition.
  • damaged is conservative. Oryx records "damaged" only with clear visual evidence; many damaged-then-repaired units are not counted.
  • Date ambiguity. The cumulative snapshot has no per-record event dates; the rollup view buckets everything into the snapshot month. Per-event dating requires a per-unit source (e.g. WarSpotting) — future work.
  • Snapshot is frozen at 2024-11-19. Numbers are stale relative to live Oryx. Refresh via the documented path (§6) to advance the snapshot.
  • Naval/EW edge categories map to other where no clean enum bucket fits (e.g. Jammers And Deception Systems, Unmanned Ground Vehicles).

v1 seed totals (snapshot 2024-11-19): 166 loss records, 27 equipment types, 1 conflict, 2 belligerents, ~26,255 visually-confirmed unit-losses accounted.


5. Non-commercial use note

This dataset is assembled for personal, non-commercial homelab use. The underlying Oryx visual-confirmation data is published non-commercially; UCDP is CC-BY. Any deployed view of this data (e.g. globalmilitary.blackmesalabs.org) must:

  1. Display visible attribution to Oryx (primary visual confirmation) and the leedrake5/Russia-Ukraine mirror.
  2. Attribute UCDP for conflict metadata.
  3. Carry a non-commercial-use notice and a note that figures are visually-confirmed minima, not total losses.

Do not repackage or sell this dataset. Do not scrape globalmilitary.net.


6. Refresh cadence

The seed is a pinned snapshot for reproducibility. To advance it:

# 1. Edit scripts/sources.json: bump snapshot_date + url to a newer byType file
#    (https://github.com/leedrake5/Russia-Ukraine/tree/main/data/byType),
#    and update `local` to data/raw/leedrake5_byType_<DATE>.csv
# 2. Re-fetch, verify, re-pin checksums, and rewrite the seed:
python3 scripts/ingest.py --refresh
# 3. Sanity-check & confirm idempotency:
python3 scripts/ingest.py --check
# 4. Review the diff in data/seed/ and data/raw/, then commit.

--refresh fetches with a browser User-Agent (several OSINT mirrors 403 default agents) and re-pins the sha256 in sources.json. Default runs are offline — they read the committed data/raw/ snapshot and verify its checksum, so the build is reproducible without network access.

Suggested cadence: monthly, or before a deploy. Oryx itself updates daily; we pin to avoid a moving dataset under the app.


File map

Path What
data/schema/schema.sql SQLite DDL (4 tables + loss_rollup view)
data/schema/schema.json TypeScript-compatible JSON Schema mirror
data/raw/leedrake5_byType_2024-11-19.csv Pinned upstream snapshot (sha256 in sources.json)
data/raw/reference.json Hand-authored conflict/belligerent metadata + category map
data/seed/*.csv Generated normalized seed (conflicts, belligerents, equipment_types, loss_records)
scripts/sources.json Pinned source URLs, licenses, checksums
scripts/ingest.py Idempotent ingest (stdlib only; --refresh, --check)