What’s the deal with RL and forecasting?

Nov 20, 2025

Prediction is difficult, especially about the future. It requires understanding the world as it is now, and causal reasoning about how it will change over time. It is very tempting as a machine learning task for many reasons:

It has virtually unlimited performance ceilings: people perform way worse than it is in principle possible to achieve.
Unlike other tasks that satisfy the above (e.g. “good writing”), there is ultimately very clear ground truth data (what has occurred vs what has not occurred), so it’s possible to sanity check model performance.
It is an extremely general task: it is possible to construct forecasting datasets that cover basically any domain of human interest.
Superhuman prediction on many domains is ridiculously easy to convert into money. There is no business you need to build. You just bet on a prediction market. I don’t know any other machine learning task with this property.

Recently the AI field has shifted from training on broad data sources to targeted training on RL environments as a key driver of progress. The key bottleneck to RL on most tasks is proper evaluation; once you get evaluation sorted for a task and solve a few other ancillary issues, RL will work.

During my PhD, I spent quite a bit of time thinking about forecasting evaluation. As I want to get deeper into RL, it’s time to look into papers doing RL and forecasting.

Outcome-based Reinforcement Learning to Predict the Future

The work of a human forecaster is best modeled as a reasoning + tool call chain: the forecaster thinks through the question, searches for relevant information, writes code to compute value X, searches for additional information, does reasoning again, and outputs the final prediction. This is not unlike what is optimized for by recent LLM releases; see for instance the Kimi K2 release docs.

For a concrete example, consider the question “Will OpenAI release GPT-6 by March 2026?” An LLM forecaster emulating a human forecaster would reason as follows:

Initial reasoning: We need to consider (1) time since GPT-5’s release and typical release cycles; (2) OpenAI’s recent public statements about development priorities
Tool call: Search OpenAI’s release history and recent announcements
More reasoning: Wait. Maybe we also consider competitive pressure from other labs.
Tool call: Search when Google, Anthropic, X.ai are planning major releases
Final reasoning: Weight all factors; by OpenAI’s cadence and statements it is unlikely, but also consider fast release cycles of other labs; so let’s say 10%
Output: 10%

This type of reasoning is difficult to train into a model, for two reasons. The first reason is a technical difficulty:

to have labels, we need to assume we are in the past, and predict the present;
search engine queries usually leak some information about the present. This is why multiple papers described below use repositories of frozen search results instead of online search. It’s possible to resolve this in online search, but it’s not straightforward to do so, because there is adversarial pressure coming from the model to exploit any temporal leakage in search. The models are very good at reward hacking!

The second reason is conceptual: search and reasoning will be difficult to optimize jointly because credit assignment in forecasting is difficult. It’s not automatically clear whether the final prediction is wrong because the search did not retrieve the right evidence, or because the model reasoned badly about the evidence. 1

There is a simplification of this process that is much easier to optimize via RL: factorize the task into retrieval and reasoning. The retrieval step can be judged on its own merits (did we retrieve all relevant information?); and the reasoning step can be directly optimized using RL in a very similar setup as we do for non-forecasting quantitative reasoning tasks.

This paper takes this approach and downloads the relevant papers before prediction, and do not optimize search at all. This is a reasonable approach for a RL-first startup to take given that other forecasting people are focusing on search.

They train a 14B Qwen model with reasoning distilled from DeepSeek-R1. The model takes as input a TRUE/FALSE forecasting question and a set of news articles retrieved by their system, and tries to predict the final answer.

The paper spends considerable time discussing hyperparameter optimization and GRPO. I thank them for doing this, but I think their exact results are not very important, given that the research community is iterating fast on getting better general RL algorithms for LLMs and it seems unlikely that forecasting will require special treatment from the algorithmic standpoint. The data / environment design, on the other hand, is where forecasting is quite unique.

They claim that their model would produce 10% gains on Polymarket. I would not read too much into this result: all models trade profitably in their setup.

Note that even the baseline DeepSeek-R1-Distill-Qwen-14B earns close to $40. I have talked to this model and it is not a smart model.

In the absence of other relevant papers on arXiv, I turn to ICLR 2026 papers under review. (Disclaimer: the papers below are anonymous submissions to ICLR 2026. ICLR is a top-tier conference in machine learning with a very transparent review process: all papers and reviews are public immediately, but the author names are anonymized until the paper is accepted. I am not an official reviewer for any of the papers below.)

Scaling Open-Ended Reasoning to Predict the Future

To the best of my knowledge, this is the only paper submitted to ICLR 2026 that does RL for forecasting.

They again start from an 8B Qwen with reasoning distilled from DeepSeek-R1. Differently from the Outcome-based RL paper, they first do reasoning distillation on 10’000 forecasting questions, taking traces from Grok-3-mini. This helps the model learn to reason.

The training data pipeline is different from the Outcome-based RL paper: instead of scraping forecasting questions, they do synthetic data! Generating forecasting questions from news articles (pretending someone is asking a question from the past) has been explored before. This is a whole new can of worms and you need a lot of care not to implicitly leak information from the future into the past. They do a whole bunch of ad-hoc filtering steps to avoid leakage.

Instead of pre-crawling news for every question, they use the CommonCrawl News corpus, which provides static, monthly snapshots of easily reachable parts of the web, with the exception of many websites that have started opting out of being crawled since LLMs became popular.

Regarding retrieval, they do a similar thing as the Outcome-based RL paper: just provide some chunks of text from the CommonCrawl corpus before the model starts reasoning. As their corpus is offline, they can’t use a search engine; so they resort to matching text with an embedding model. The performance gains are large up to 5 retrieved chunks but don’t increase further; this makes me think larger gains are possible by doing this step properly.

The reward function is again some Brier score variant, but this time it works on multiple-class predictions instead of just on YES/NO questions. They report different results based on what exact Brier score variant is used, so perhaps this is worth ablating on as a hyperparameter in future RL work.

Highly technical note: I now see two forecasting papers using the same modification to GRPO: they compute the rollout advantage as “reward - mean” instead of “(reward - mean) / stddev”. I must confess that I do not know why this normalization was ever used in the first place. If I am supposed to think of the mean group reward as analogous to a policy gradient baseline, then just subtracting without normalizing yields an unbiased gradient estimator. The original DeepSeekMath paper is not clear on this point.

Forecasting with LLMs: A Dataset for Rapid Backtesting Without Temporal Contamination

This paper does the exact thing I was hinting at that is needed: They scrape unresolved questions from Kalshi, save live web search results for each unresolved question at the time of scraping, summarize the search results, and package it as a (question, frozen context, resolution) dataset once the question is resolved. For each question, the frozen context is small enough to fit into the model’s effective context window; so there is no need to train the model to retrieve information at all. As of now they have over 3000 resolved questions, which should be enough to get some signal out of RL experiments.

I like this paper more than the reviewers apparently do. The execution could be a bit better; why summarize (and why with gpt-4o-mini)? Why not handpick the good context? Why are the Bing search snapshots opinionated about the outcome? But ultimately I think the idea of separating optimization and evaluation of forecasting reasoning from retrieval is a good one, and I haven’t seen a dataset before this paper that makes it easy for RL practitioners to just train a forecaster in a day. I hope they release the data soon.

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

Somehow I had missed this FutureSearch release until I read the Rapid Backtesting paper. They accompany each forecasting question with a relevant subset of a web crawl:

executing an intelligent web crawl that attempts to exhaustively search over the avenues a forecaster might take when researching a question and package the results into a nice environment that emulates the model having access to a search engine.

The retrieval step is still non-trivial, as there are thousands of pages saved for every forecasting question, and not all of them can be fed into the model at once. But this kind of dataset is the best of both worlds:

If you wanted to separate retrieval and reasoning, it seems straightforward to use their data to construct a dataset similar to the Rapid Backtesting paper, to any desired degree of granularity.
If you want to optimize for retrieval too, using their environment seems much better than using the CommonCrawl dumps as in the Scaling Open-Ended Reasoning paper.

As long as the models’ training cutoff date is before these events came to pass. Literally the only issue I find with this dataset is that it is small. 300 questions might be enough to evaluate a forecaster, but not to train one. Also, the data is not to be released publicly. Well.

Here is a free research idea: retroactive post-mortem analysis of where the forecast went wrong (or, in case it was correct, whether it got lucky) can help with credit assignment. This might not be compute-efficient for RL in general compared to just training on more data; but in forecasting we are uniquely data-limited compared to other domains.

Daniel Paleka's Newsletter

Discussion about this post