A transformer trained on 2.2 million soccer events learned to predict the NWSL Championship. This investigation asks: what did it actually learn — and what did the encoding apparatus decide for it before training began?
EVE-2 is a 26.2 million parameter transformer trained on nine seasons of NWSL match events. It reads sequences of passes, tackles, shots, and clearances — each discretized into eight fields, 114 tokens total — and predicts the next event. The training objective is pure next-event prediction: cross-entropy loss with self-conditioned regularization.
The results are striking. The model produces match simulations within 6% of real shot rates. Event distributions diverge from reality by only JSD 0.026. Applied to the 2025 Championship Final — a match it never trained on — it reversed the market prediction, correctly favoring Gotham FC over the higher-seeded Washington Spirit. The actual scoreline (0–1 Gotham) was the model's second most probable outcome.
This investigation raises three questions — not as dismissals, but as productive problems worth keeping alive. Each section that follows makes one of these problems interactive.
Tom writes that the model "never asks why — it only asks what comes next." But multi-head self-attention computes far more than next-token statistics. Each of the 6 heads across 16 layers can learn different relational structures — possession rhythm, spatial flow, pressing signatures. The model's championship prediction proves it's encoding something deeper than the paper's philosophical frame admits.
Flusser taught us that the apparatus programs the possible outputs before the operator touches it. EVE-2's tokenization — 41 event types, 10×10 grid, 9 time buckets — is an enormous act of human curation. The conceptual work of deciding what counts as an "event" is done before gradient descent begins. How much of the model's knowledge was given to it by the encoding, and how much was computed?
With 114 tokens, 128-event context windows, and a domain where 55.8% of events are passes, the effective state space is vastly smaller than the theoretical maximum. A 26.2M parameter model looking at 2.2M events from a highly regular domain — is it generalizing, or is it a sophisticated lookup table? The game state blindness (49.7% probe accuracy) suggests the internal representations may be shallower than the architecture permits.
Tom McMillan
Model and simulator: sim.nwslnotebook.com
An NWSL soccer match generates roughly 1,600 observable events, such as passes, tackles, shots, clearances, etc. These events are recorded and labeled by data providers, in granular detail. This interested me. So I trained a transformer on 2.2 million NWSL match events. A transformer is a type of neural network that learns sequential structure — the same architecture that underlies large language models such as OpenAI's GPT . I named my model EVE-2, short for "Event".
A transformer learns by predicting the next event from the previous ones. It never asks why — it only asks what comes next. And yet from that purely local, purely statistical question, something that looks like understanding of the game emerges. That understanding is the main subject of this paper.
This paper is broken into five sections. Section 3 describes what the event stream contains and what it cannot. Section 4 describes the choices made to extract knowledge from that data. Section 5 measures how far that knowledge extends. Section 6 applies the model to the 2025 NWSL Championship Final and locates the boundary in a specific match, a specific moment, and a specific player. The boundary is known. The boundary is a finding.
The event-stream analytics lineage this paper builds on was established by Decroos et al., who introduced SPADL — a unified language for representing soccer actions — and VAEP, a framework for valuing those actions by their impact on match outcomes . EVE-2 extends this lineage from measurement to generation: rather than valuing what happened, it generates what might happen next.
The closest prior work in generative soccer modeling is TacticAI , a generative model developed by DeepMind and Liverpool FC that produces tactical recommendations for corner kicks. TacticAI operates on tracking data — the positions of all players at high frequency — and is designed for a specific set piece situation. EVE-2 operates on event streams across full matches without tracking data. The two systems occupy different data regimes and ask different questions: TacticAI asks what a team should do next; EVE-2 asks what a team is likely to do next.
The philosophical framing of what a model can and cannot know from its training data draws on Foltz-Smith .
EVE-2 was trained on nine seasons of NWSL match events. This section describes that data — what it contains, how it was prepared, and why it comes from this league alone.
Opta records every on-ball action in an NWSL match as a structured event. Each event has a type, a location on the pitch, an outcome, a timestamp, and a set of qualifiers that describe additional context — whether a pass was a long ball, whether a shot was a big chance, which body part was used. A typical match contains approximately 1,628 events. The full corpus spans 1,381 matches across ten seasons, 2016 through 2025, totaling 2,248,532 events.
There are 41 distinct event types. Passes are the most common, accounting for 55.8 of all events. Goals account for 0.17. Thirteen non-gameplay event types — match lifecycle records such as start, end, and formation change — are removed during preprocessing. What remains is a complete sequential record of every on-ball action in every match.
The raw Opta feed is enriched with SPADL features before training . SPADL provides five additional fields per event: action type, result, body part, and spatial endpoints in a standardized schema. SPADL coverage is 75.2 — not every Opta event maps to a SPADL action. Events without a SPADL match receive a null token.
These eight fields are embedded into a single 384-dimensional vector. The model sees a window of 128 such vectors — the last 128 events — and predicts the next event's eight fields simultaneously.
The data is split by season. Seasons 2016 through 2022 form the training set: 1,611,422 events across approximately 808 matches. The 2023 season is validation: 315,645 events across approximately 176 matches. The 2024 season is the test set: 296,679 events across approximately 223 matches. The 2025 season is fully held out: 24,786 events across approximately 174 matches, including the championship final.
Splits are by season with no within-season mixing. This avoids leakage from within-season correlation — no event from a season the model trained on appears in the test set. The model has never seen a single 2025 match.
One competition is excluded: the NWSL x Liga MXF Summer Cup, a non-standard competition run in 2022. All regular season, playoff, challenge cup, and fall series matches are included.
EVE-2 is trained on NWSL data exclusively. Not adapted from men's leagues. Not pre-trained on a larger multi-league corpus and fine-tuned. The NWSL only.
This is a commitment, not a constraint. The NWSL has its own tactical culture, its own tempo, its own truth about how the game unfolds. A model trained on it learns that specificity. Diluting the corpus with data from other competitions would produce a more general model and a less honest one.
The standard counterargument is that pre-training on a larger corpus and fine-tuning on NWSL would give the model both broader representations and league-specific outputs. This is a reasonable position. It is also untested. I chose the NWSL-only path first because the specificity of the training data is itself a question worth answering: is nine seasons of a single league sufficient? The results in Section 4 suggest it is.
Cross-league transfer remains open.
Every choice in this section produced a different model. The architecture, the loss function, the optimizer, the checkpoint selector—change any one and a different simulator emerges with different knowledge and different failures. I trace those choices in order, because each one's conclusion is the next one's premise.
A match is a sequence. What happens next depends on what just happened—not only the immediately preceding event but the full recent history of play. A clearance from the back line creates different possibilities than a pass into the attacking third. A team that just conceded behaves differently than one that just scored.
Transformers are designed for exactly this structure. The attention mechanism reads the full context window simultaneously and learns which prior events matter for predicting the next one. This is not possible with a Markov model, which sees only the immediately preceding state, or with a simple feedforward network, which has no sequential memory at all.
EVE-2 is a 26.2M parameter transformer with 16 layers, 384 dimensions, 6 attention heads, and a context window of 128 events. It reads the last 128 events and predicts the next one. The architecture defines what representations are possible. Everything else in this section operates within the space it creates.
A transformer trained on soccer events must be told what it means to get a prediction wrong. The simplest answer is cross-entropy on the next event: penalize the model for assigning low probability to whatever actually happened. This is the standard objective for autoregressive sequence models.
It is not sufficient. Early EVE models trained on cross-entropy alone produced possession fragmentation — locally plausible events that accumulated into globally unrealistic matches. The model learned to predict individual events accurately while generating sequences that looked nothing like soccer. The loss function had no concept of a possession, a phase of play, or a match.
The breakthrough was SC4c, a self-conditioned training regime that adds sequence-level regularization on top of cross-entropy. SC4c penalizes possession fragmentation directly: during training, the model periodically generates short rollouts from its own predictions and is penalized when those rollouts produce unrealistic possession statistics. (SC4c is an unpublished training regime developed during the EVE research program.) The loss function now expresses something closer to what a realistic match looks like.
The loss function defines what it means to be wrong. Change it, and you redefine the geometry of improvement — and get a different model with different knowledge.
Given an architecture and a loss function, the optimizer defines how the model searches the landscape they create. Different optimizers encode different assumptions about what local information is worth attending to. The choice is not neutral.
EVE-2 uses a Muon hybrid optimizer: AdamW for embedding and output layers, Muon for the transformer core. (Muon is an unpublished optimizer developed by Kosson et al. No formal citation is currently available.) AdamW makes per-parameter adaptive updates — each weight gets its own sense of proportion. Muon approximates the natural gradient, making updates that respect the geometry of the weight matrix manifold rather than treating each parameter independently.
The difference is empirical as well as theoretical. Switching from AdamW-only to the Muon hybrid improved possession length from 2.2 to 2.4 events per possession on identical data and architecture. The optimizer changed what the model learned, not just how fast it learned it.
The same data, the same loss function, the same architecture — a different optimizer produces a different model with different knowledge.
You cannot predict where gradient descent will converge without running it. The loss landscape of a 26.2M parameter transformer is irreducibly complex — no analytical formula can tell you which checkpoint will produce the best simulator. You must train and observe.
This creates a selection problem. Training runs for 6 epochs, producing 6 checkpoints. The naive criterion is validation loss: keep the checkpoint that predicts held-out events most accurately. That criterion is wrong for a simulator. A checkpoint with better validation accuracy can produce worse full-match simulations, because the errors it makes compound differently under autoregressive generation. One-step accuracy and rollout quality diverge.
During development, changing only the checkpoint selector — holding architecture, loss, data, and optimizer identical — turned an apparently weak seed into a viable simulator. The weights did not change. The criterion for choosing among them did.
EVE-2's checkpoint is selected by rollout quality: the model is run for H=200 events across 20 held-out matches, and the checkpoint with the best possession dynamics and event distribution is kept. Rollout quality, not validation loss, tells you where you actually ended up.
Some things the model cannot learn from the loss landscape alone. Two structural interventions compensate for what the objective cannot express.
The first is the budget controller. A 128-event context window captures several minutes of play but cannot see the full match. It does not know how many shots have been taken, how many goals have been scored, or how much time remains. Without this information, locally plausible events accumulate into globally impossible matches — too many shots, unrealistic late-game scoring bursts. The budget controller maintains running match-level statistics and conditions the model's output on them. When the shot budget is exhausted, shot-like events are suppressed. When the goal budget is spent, scoring opportunities are dampened. This is not learned. It is imposed.
The second is three decode-time corrections applied at inference without modifying trained weights. Team side calibration adjusts the probability of possession-switching events to match observed transition rates — the model's team side head cannot condition on the event type it just generated, so tackles do not always switch possession as they should. Shot rate correction suppresses shot-like events globally; without it the model generates approximately twice the real NWSL shot rate. Budget-aware decode dampens late-game scoring when the statistical budget is exhausted.
Each correction names a place where the loss landscape has no valley. The model cannot learn what the objective does not demand. These are not patches — they are the precise locations where next-event prediction meets the boundary of what it can express, and targets for the next version of the model.
Given those commitments — here is what emerged. The results fall into two categories: places where the model works, and places where it stops. The stops are of two kinds — limits of the data, which no model can overcome, and limits of the model, which future work can fix. That distinction is the finding.
The central question for a simulator is whether its output belongs to the same statistical world as the real thing. Three measures address this directly.
Shot volume: EVE-2 generates 21.4 shots per match against a real NWSL average of 22.8 — a ratio of 0.94x (Table above). This figure includes all three decode-time corrections described in Section 3.5. Without the shot rate correction, the model generates approximately 47 shots per match.
Event distribution: the Jensen-Shannon Divergence between simulated and real event distributions is 0.026 at full-match scale. JSD measures how different two distributions are on a scale from 0 (identical) to 1 (nothing in common). At 0.026, the mix of events in a simulated match is close to the mix in a real one.
Possession length: simulated matches average 2.77 events per uninterrupted possession against a real NWSL average of 2.59 — a 7\ consequence of the information ceiling described in Section above: the model under-predicts possession loss, so possessions run slightly long.
21.4 shots per match. JSD 0.026. These are recognizable as NWSL soccer.
Most autoregressive models degrade over long sequences. Small errors change the context, which shifts subsequent predictions further from the training distribution, which compounds into output that looks less and less like the real thing. EVE-2 does not do this.
JSD improves monotonically from 0.053 at H=200 to 0.026 at H=1500 (Table above). The model becomes more realistic as it generates more events, not less. This holds across all four horizons tested.
Two candidate explanations exist. Soccer's sequential grammar — possession alternation, natural phase transitions, the rhythm of attack and defense — may create a self-correcting dynamic: an over-representation of one event type creates context that suppresses it in subsequent predictions, pulling the distribution back toward reality. The budget controller may also contribute, stabilizing long-run shot and goal statistics. I cannot currently distinguish between these two explanations.
What is clear is the practical consequence: EVE-2 is most reliable at the scale it is designed for. A 200-event snippet is noisier and less representative than a full match. The simulator should be run to completion.
EVE-2 does not know the score.
To test this, I extracted the model's final-layer representations from 46 test matches, selecting sequences that begin 10 events after the first goal is scored. I then trained a linear probe — a logistic regression classifier — to predict whether the team in possession was trailing or leading from those representations.
The probe achieved 49.7\ model's internal representations contain zero information about game state.
This is surprising. The model receives goal-count features as explicit inputs. It knows, in some mechanical sense, that goals have been scored. But across 16 layers of self-attention, the transformer never composes those inputs into a representation that distinguishes trailing from leading. The information enters the model and dissolves.
The explanation is gradient pressure. The model is trained to predict the next event, and predicting the next event does not require knowing the score — the vast majority of events are passes regardless of the scoreline. There is no training signal demanding that the model encode game state, so it never learns to.
The behavioral consequences are measurable (Table above). In real matches, trailing teams play 0.36 pitch bins higher — they push forward. EVE-2 captures 0.04 bins of this shift. Trailing-team possession length is inverted: real trailing teams sustain longer possessions under attacking pressure, but EVE-2 gives them shorter ones. The model does not merely fail to capture tactical asymmetry — in one dimension, it gets the direction wrong.
Possession loss is the weakest event class: F1 of 0.144 and recall of 10.1\ will lose the ball.
This is not a modeling failure. It is an information problem. A possession ends because a defender closes the passing lane, a player's first touch is heavy, a press is triggered by a specific off-ball movement. None of this appears in the event stream. The stream records that a possession ended. It does not record why.
Across the EVE research program, possession loss recall did not improve meaningfully as model capacity increased. Larger models, more parameters, more training — the ceiling held. This is the signature of an information limit rather than a capacity limit: more model cannot recover signal that is not in the data.
My estimate is that possession loss recall from event sequences alone is unlikely to exceed 20–25\ grounded in empirical observation across the research program, not a formal bound. Tracking data — the positions of all 22 players at 25 frames per second — would make the causal structure of possession loss visible. That is a different data regime and a future project.
The low recall cascades. Simulated matches have artificially clean passing sequences. Pass completion is inflated. Possessions run slightly long. These are not independent failures — they are downstream consequences of a single upstream limit in the data.
The model exhibits a regression-to-center bias in spatial predictions. Events in the attacking third are predicted approximately 1.3 bins too far from goal. Events in the defensive third are pulled forward. The model has learned that most events occur in midfield — which is true — and when uncertain, it defaults toward the center of the pitch (Table above).
This is different in kind from the findings in Sections above and above. Those limits are properties of the data — no training objective can recover signal that the event stream does not contain. This limit belongs to the model. Cross-entropy treats all wrong spatial bins as equally wrong: predicting bin 6 when the answer is bin 9 receives the same penalty as predicting bin 8. There is no gradient pressure to stay close. An ordinal spatial loss — one that penalizes distance, not just incorrectness — would fix this.
Two properties of the bias are worth noting. Home and away spatial errors are symmetric: 0.616 versus 0.614 mean absolute error. The bias does not create an asymmetry between teams, so comparative simulations are not distorted by which side is assigned to which team. And the bias is uniform in direction — events are pulled toward midfield regardless of which team generates them — which means its effect on counterfactual comparisons is partially self-canceling.
The spatial bias is a known artifact with a known fix. It is on the EVE-3 target list.
The 2025 NWSL Championship Final is the right example for this paper. It is the highest-stakes match of the season. It was entirely held out — EVE-2 has never seen a single 2025 match during training or validation. And it was decided by a single goal in the 79th minute, scored by a player whose presence on the pitch at that moment was the consequence of an injury the event stream cannot record.
This section applies EVE-2 to that match and asks four questions. Did the model's pre-match prediction contradict prevailing wisdom, and was it right? What did 500 simulations from kickoff capture and miss about how the match actually unfolded? What did 200 simulations from the moment before the decisive goal reveal? And what does any of this tell us about the nature of what EVE-2 knows?
Washington entered the final as the No.\ 2 seed. Gotham entered as the No.\ 8 — the first eighth seed in NWSL history to reach the championship. Previews framed the match accordingly. ESPN described Gotham as "playing the role of Cinderella" . Pre-match probability estimates gave Washington a 38\ of winning in regulation, Gotham 33\
EVE-2 assigned Gotham a 40.8\ Washington 26.2\ percentage points (Table above). The model had no access to seeding, narrative, or public sentiment. It saw only what the 2025 season event streams showed it about how each team played.
Gotham won 1–0.
The model was right. More precisely: a model that asks only "what comes next?" — trained on event sequences, with no concept of playoff seeding or underdog narratives — identified the stronger team more accurately than the aggregate of informed observers. Why? Because event sequences contain signal that narrative-driven prediction discards. Gotham's 2025 season, read as a sequence of 1,628 events per match, looked like a stronger team than Washington's. The model saw that. The market, anchored to seeding and story, did not.
Five hundred simulations from kickoff produce a distribution of outcomes, not a single prediction. The question is whether that distribution resembles the match that actually occurred.
Several features were captured well. The model predicted a low-scoring, tight match: goals per match averaged 1.65 in simulation, well below the NWSL season average of 2.62. The most common simulated scoreline was 0–0 at 19.4\ by 0–1 Gotham at 16.4\ outshoots Washington in simulation: 10.8 shots per match versus 8.5. Simulated possession split: 58\ Actual possession: 56\ correctly identified the character of the match before it was played (Table above).
Two features were missed. The model cannot represent game state — as established in Section above — which means it cannot simulate how either team would respond to falling behind or pushing for a late winner. The territorial compression identified in Section above means the model's spatial profile of both teams is pulled toward midfield, flattening the distinction between Gotham's defensive shape and Washington's attacking structure.
Both failures are consistent with what we already know about the model. They are not new findings. They are the known limits of Section 4, visible in a specific match.
At 78:45 the game stops. A foul. An injury pause. A free kick not yet taken. The score is 0–0. EVE-2's context window at that moment contains the last 128 events: Gotham with 56\ possession, pressing in the attacking third, both teams down to one remaining shot budget.
Two hundred simulations from this seed (Table above). The results:
The actual result falls in the 4\ the free kick. Rose Lavelle's goal, minute 79.
Two readings of this number are defensible.
The first: the model failed. It assigned 94\ no goal because it could not see what was about to happen. Hal Hershfelt had left the pitch injured, receiving treatment on the sideline before returning visibly compromised . Her mark was open. The event stream records a 15-second stoppage — indistinguishable from any other dead ball. The model had no way to know the defensive shape had been disrupted. A 4\ measurement of the model's blindness.
The second: the model was correct. A 4\ model said this situation produces a goal 6\ It happened. That is not a failure — that is a calibrated probability doing exactly what a calibrated probability should do. The model cannot predict which 4\ produce goals. It can only say that this kind of situation, in this kind of match, produces one 6\ reading, the 4\ measure of rarity.
Both readings are true. The model could not see the injury. The goal was also genuinely unusual. These are not competing explanations — they are the same fact described from two directions.
What do these results tell us about the model, the method, and what it means to know something about a soccer match?
EVE-2 favored Gotham. It predicted a low-scoring, tight match. It correctly identified the actual scoreline as the second most probable outcome. It assigned the decisive moment a 4\ probability — low, but not zero, and correctly directioned. These are not accidents. They are the product of a specific training process: nine seasons of NWSL event sequences, processed through a specific architecture, minimizing a specific loss function, selected by a specific checkpoint criterion.
As Foltz-Smith writes: "the knowledge encoded in a neural network is not found but computed. The distinction matters. `Found' implies the answer pre-exists, waiting in parameter space like ore in a mountain. `Computed' implies it is brought into existence by the process of descent. The model's knowledge is constituted by the trajectory" .
EVE-2's knowledge of the NWSL is constituted by its trajectory. Change the data, the architecture, the loss function, or the random seed, and you get a different model with different knowledge — one that might have agreed with the market, or disagreed in the opposite direction. The prediction is contingent. It is not objective truth. It is one path through one loss landscape, trained on one league, selected by one criterion.
And yet it was right. Not because it is a better oracle than the market. Because it attends to different signal. The market aggregates narrative, seeding, public sentiment, and expert opinion. EVE-2 aggregates only the sequential structure of how each team played, event by event, across an entire season. That structure contains information the market discarded. The model found it — or more precisely, computed it — because nothing in its training demanded that it ignore it.
The places where it failed are equally instructive. It could not see Hal Hershfelt's injury. It could not represent how Gotham would press differently when trailing, or how Washington would respond when pushed back. These are not engineering failures. They are the information ceiling from Section above and the game state blindness from Section above, made human and specific. The model sees the grammar of how NWSL matches unfold. It does not see why they unfold that way.
That is what event-stream simulation can and cannot do. The boundary is not arbitrary. It is the boundary between outcomes and causes — between what the data records and what the data cannot contain. EVE-2 operates well within that boundary. The 2025 championship shows both how well, and exactly where the boundary lies.
A soccer match generates roughly 1,200 observable events. This paper asked how much a transformer trained on those events can learn about the game — and where that learning ends.
The answer has two parts.
EVE-2 learns a great deal. It produces match simulations within 6\ 0.026, and stability that improves rather than degrades over a full match. Applied to the 2025 NWSL Championship Final, it favored the correct winner against the prevailing expectation, predicted the correct character of the match, and assigned the decisive moment a calibrated probability that correctly identified it as unusual.
And it stops at a precise boundary. The game state probe achieved 49.7\ next-event prediction provides insufficient gradient pressure to encode scoreline. Possession loss recall of 10.1\ an information ceiling inherent to event data: a possession ends because of causal factors the event stream does not record. The spatial regression bias is a fixable artifact. The game state blindness and the possession loss ceiling are not. They are properties of the data, not the model.
The boundary between what EVE-2 knows and does not know is not a line between success and failure. It is a line between outcomes and causes — between what the event stream records and what it cannot contain. Within that boundary, the model operates well. Beyond it, no amount of architectural or objective improvement will help. Tracking data is required.
The model is not a prediction engine. It is a measurement of how much a soccer match can be learned from its own record. That turns out to be more than the market knew, and less than the game contains.
EVE-2 is available as a public simulator at https://sim.nwslnotebook.com.
Event data is sourced from Opta via the Perform API. NWSL historical data and league context drew on nwslsoccer.com and the nwslR package . The SPADL representation and socceraction library were essential data infrastructure . The transformer architecture and training recipe were developed with reference to Andrej Karpathy's nanochat repository. Model development, data pipeline construction, and evaluation infrastructure were implemented with substantial assistance from Claude Code (Anthropic) and Codex (OpenAI), operating as autonomous coding agents under the author's direction. The philosophical framing in Section above draws on Foltz-Smith .