Financial SOTA Agent Survey for Crypto-Alpha-Bench¶
Purpose: map current financial SOTA agents / alpha-mining systems before committing to Crypto-Alpha-Bench.
Updated: 2026-05-19
Core question: What existing agents and benchmarks already cover alpha auto search, and what gap remains for Crypto-Alpha-Bench?
0. Executive Verdict¶
Yes: before positioning Crypto-Alpha-Bench, we must survey current financial SOTA agents and alpha-mining benchmarks.
The important conclusion is not "nobody has done this." The important conclusion is:
Existing work already covers formula alpha generation, factor evaluation, iterative LLM search, multi-agent quant R&D, and trading-decision agents. Crypto-Alpha-Bench must therefore position itself as an executable crypto mid-frequency alpha benchmark with fixed data, cost tiers, fill/tradability constraints, DSR/PBO, compute control, synthetic ground truth, and optional human expert baseline.
The biggest direct threats:
- AlphaBench: already claims the first systematic benchmark for LLMs in formulaic alpha factor mining.
- RD-Agent(Q): already claims full-stack multi-agent quant R&D with factor-model joint optimization.
- QuantaAlpha / AlphaAgent / Alpha Jungle: already cover LLM-driven formula alpha search with increasingly sophisticated exploration and regularization.
- Hubble / FactorMiner / CogAlpha / Beyond Prompting: 2025-2026 work pushes toward safer sandboxed generation, memory-driven self-evolution, code-level evolution, and autonomous factor investing.
The clean gap:
Most existing systems optimize formula alpha quality or trading-decision performance. They do not jointly enforce crypto-specific executable-alpha constraints: public crypto perp data, cost tiers, fill/tradability gates, compute-controlled search, multiple-testing correction, synthetic ground truth, and human discretionary baseline.
1. Taxonomy: What Counts As "Financial Agent SOTA"?¶
Use four buckets. Do not mix them.
| Bucket | What It Optimizes | Typical Output | Direct Threat To Crypto-Alpha-Bench? |
|---|---|---|---|
| A. Alpha Mining Agents | Discover formulaic alphas / factors | alpha expression, factor pool | High |
| B. Full Quant R&D Agents | Automate research workflow: hypothesis → code → backtest → feedback | factor + model + report | High |
| C. Trading Decision Agents | Decide buy/sell/hold or portfolio action from multimodal context | trading action | Medium |
| D. Benchmark / Evaluation Frameworks | Standardize tasks and metrics | leaderboard / protocol | High |
Crypto-Alpha-Bench should mostly compare against A/B/D, while using C as neighboring context.
2. Benchmark Gap Matrix¶
This is the single most important table.
| System | Task | Search Unit | Verifier | Data / Market | Cost / Fill | Multiple Testing | Compute Control | Main Gap For Crypto-Alpha-Bench |
|---|---|---|---|---|---|---|---|---|
| AlphaBench | LLM formulaic alpha mining benchmark | formula expression | Qlib backtest + factor metrics | CSI300 + SP500, 2020-2025 | not the central axis | not central | analyzes model/settings, but not a strict compute-budget benchmark | no crypto perp executable-alpha protocol |
| RD-Agent(Q) | full quant R&D automation | factor + model code | real-market backtests + feedback stage | stock markets | not the central axis | not emphasized | MAB scheduling, but benchmark compute control not the core | strong agent; weak public benchmark substrate |
| QuantaAlpha | LLM-driven evolutionary alpha mining | mining trajectory + factor | backtest / IC / return metrics | CSI300 + transfer to CSI500/SP500 | not central | not central | trajectory reuse, but not fixed budget protocol | trajectory search is useful baseline, not full benchmark |
| AlphaAgent | decay-resistant LLM alpha mining | formula expression | market evaluation across regimes | CSI500 + SP500 | not central | regularizes decay but not DSR/PBO | not core | useful regularization baseline |
| Alpha Jungle | LLM + MCTS formula mining | formula expression / tree | backtest feedback | stock market data | not central | not central | MCTS budget implicit | search baseline, not benchmark protocol |
| AlphaSAGE | GFlowNet alpha mining | expression graph / factor portfolio | multi-faceted reward | stock alpha mining | not central | not central | GFlowNet exploration budget | strong non-LLM alpha-search baseline |
| Hubble | safe/reproducible LLM alpha discovery | operator tree | AST sandbox + cross-sectional metrics | U.S. equities | turnover included, but fill/cost not core | OOS/HAC evidence, not DSR/PBO benchmark | fixed rounds/candidate counts | closest to safe verifier framing, still equity/formula-centric |
| FactorMiner | self-evolving alpha discovery | formulaic factor + memory | modular evaluation tools | multi-asset datasets | not central | redundancy/correlation control, not DSR/PBO | lightweight iterative loop | strong memory baseline, not crypto executable benchmark |
| CogAlpha | LLM-driven code evolution | executable alpha code | evolutionary fitness feedback | A-share equities | not central | robustness/generalization claimed, not benchmark protocol | evolutionary budget implicit | useful code-evolution baseline |
| Beyond Prompting | autonomous systematic factor investing | interpretable signal set | OOS validation + economic rationale | U.S. equities | not central | data-snooping addressed qualitatively | not core | strong autonomous-agent story, but not public benchmark substrate |
| Alpha-GPT | human-AI interactive alpha mining | human idea -> alpha | alpha mining experiments / competition evaluation | WorldQuant-style alpha context | not central | not central | interactive workflow, not controlled budget | useful human-in-loop baseline |
| TradingAgents | collaborative trading-decision agents | buy/sell/trade decision | portfolio returns, Sharpe, MDD | stocks | not central | not central | not core | trading chatbot/team, not alpha benchmark |
| QuantAgent | short-horizon / HFT-style LLM trading | structured-signal decision | predictive accuracy + trading metrics | 9 instruments incl. BTC/Nasdaq futures | not rigorous fill sim | not central | not core | relevant to crypto HFT framing, but not formula alpha mining |
| FinMem | memory-augmented LLM trading | investment decision | stock trading performance | stocks | not central | not central | not core | memory architecture reference, not benchmark |
| FinAgent | multimodal foundation trading agent | trading action | 6 financial metrics | stocks + crypto datasets | not central | not central | not core | multimodal trading baseline, not alpha search |
| AlphaEval | alpha evaluation framework | generated alpha | 5D backtest-free evaluation | formula alphas | cost/tradability not central | not central in abstract | not core | evaluation backbone, not benchmark substrate |
Result:
The benchmark claim must not be "first alpha benchmark." It should be "first executable crypto mid-frequency alpha-search benchmark with cost/fill/statistical rigor."
3. Direct Competitors / Must-Read Systems¶
3.1 AlphaBench¶
Sources:
- Project: https://alphabench.cc/
- OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ
What It Is¶
AlphaBench is an ICLR 2026 benchmark for evaluating LLMs in Formulaic Alpha Factor Mining (FAFM). It covers three core tasks:
- factor generation;
- factor evaluation;
- iterative factor searching.
The project page says its toolchain includes 1,857 instructions, an FFO execution engine, and Qlib-based backtesting. It evaluates generation settings and search paradigms such as Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | formulaic alpha factor expression |
| Generator | LLMs under different prompting/reasoning/search settings |
| Verifier | executable formula engine + Qlib backtesting |
| Tasks | Text2Alpha, Directional Mining, FactorEval, CoE, ToT, EA |
| Dataset | CSI300 and SP500, 2020-2025 according to project page |
| Metrics | IC, RankIC, robustness, win rate, skewness, reliability, stability, semantic alignment |
What It Solves¶
AlphaBench is the strongest existing answer to:
"Can LLMs generate, evaluate, and search formulaic alphas?"
It gives a standardized FAFM benchmark and should be treated as the closest benchmark sibling.
What It Does Not Solve For Us¶
Crypto-Alpha-Bench should differ on:
- crypto perpetual futures instead of stock FAFM only;
- mid-frequency executable alpha, not just formula validity / stock factor quality;
- explicit transaction-cost tiers;
- fill/tradability gate;
- DSR/PBO / multiple-testing discipline;
- compute-controlled search budgets;
- synthetic ground-truth task;
- optional human expert discretionary baseline.
How To Position Against It¶
Say:
AlphaBench is the closest existing benchmark for LLM formula alpha mining. Crypto-Alpha-Bench should adopt its task decomposition, but extend the benchmark target from formula quality to executable crypto mid-frequency alpha under cost, fill, and statistical constraints.
Do not say:
"There is no alpha benchmark."
That is no longer defensible.
3.2 RD-Agent(Q)¶
Sources:
- Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
- arXiv: https://arxiv.org/abs/2505.15155
- Docs: https://rdagent.readthedocs.io/en/latest/scens/quant_agent_fin.html
What It Is¶
RD-Agent(Q) is a data-centric multi-agent framework for automated quant R&D. It targets factor-model joint optimization, not only formula alpha mining.
From the Microsoft abstract, it decomposes quant research into:
- a Research stage that sets goal-aligned prompts, formulates hypotheses from domain priors, and maps them to tasks;
- a Development stage using a code-generation agent, Co-STEER, to implement task-specific code;
- a Feedback stage that evaluates real-market backtests and informs later iterations;
- a multi-armed bandit scheduler for adaptive direction selection.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | factor ideas, model innovations, code |
| Generator | multi-agent hypothesis and code-generation workflow |
| Verifier | executable code + real-market backtests |
| Feedback | analysis unit + MAB scheduler |
| Knowledge grounding | domain priors / knowledge forest / prior outcomes |
| Claim | up to 2x annualized return vs classical factor libraries with fewer factors, according to abstract |
What It Solves¶
RD-Agent(Q) is the strongest "full quant R&D automation" baseline.
It addresses:
- hypothesis generation;
- code implementation;
- backtest verification;
- factor-model co-design;
- adaptive allocation of research effort.
What It Does Not Solve For Us¶
It is not primarily a public benchmark protocol:
- no crypto-specific executable benchmark;
- no explicit public cost-tier/fill/tradability protocol as core contribution;
- no central DSR/PBO benchmark claim;
- no human expert discretionary baseline;
- no fixed leaderboard framing.
How To Position Against It¶
Say:
RD-Agent(Q) is a model of what the research agent could become. Crypto-Alpha-Bench is the evaluation substrate such an agent would need in crypto mid-frequency settings.
In other words:
- RD-Agent(Q) = strong agent architecture.
- Crypto-Alpha-Bench = hard verifier / benchmark environment.
3.3 QuantaAlpha¶
Source:
- arXiv: https://arxiv.org/abs/2602.07085
What It Is¶
QuantaAlpha is an evolutionary LLM-driven alpha mining framework. It treats each end-to-end mining run as a trajectory and improves factors using trajectory-level mutation and crossover.
According to the abstract, it:
- localizes suboptimal steps for targeted revision;
- recombines high-reward trajectory segments;
- enforces semantic consistency across hypothesis, factor expression, and executable code;
- constrains complexity and redundancy to reduce crowding.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | trajectory of a mining run, not just a formula |
| Generator | LLM-driven factor generation + evolutionary mutation/crossover |
| Verifier | IC / ARR / MDD / backtest-style metrics |
| Data | CSI300, with transfer claims to CSI500 and S&P500 according to abstract |
| Novelty | reuse validated experience at trajectory level |
Why It Matters¶
This is the natural next step after AlphaAgent / Alpha Jungle:
- AlphaAgent regularizes individual factor generation.
- Alpha Jungle improves tree search.
- QuantaAlpha improves whole search trajectories.
For your benchmark, QuantaAlpha should be a strong search-agent baseline.
Gap¶
Still not enough for Crypto-Alpha-Bench:
- equity-centric;
- not a public executable crypto benchmark;
- transaction cost / fill / capacity not central;
- multiple-testing discipline not central;
- compute-control not the main object.
How To Use It¶
In Crypto-Alpha-Bench:
- include a QuantaAlpha-style trajectory search baseline if implementation is available;
- require it to report compute budget and DSR/PBO;
- compare its trajectory reuse against simpler EA/ToT/CoE search under equal budget.
3.4 AlphaAgent¶
Source:
- arXiv: https://arxiv.org/abs/2502.16789
What It Is¶
AlphaAgent is an autonomous LLM-driven alpha mining framework focused on alpha decay resistance.
The abstract highlights three mechanisms:
- AST-based originality enforcement against existing alphas;
- LLM-evaluated hypothesis-factor semantic alignment;
- AST-based complexity control to prevent over-engineered formulas.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | formulaic alpha expression |
| Generator | LLM agent |
| Regularizer | originality, semantic alignment, complexity |
| Verifier | market performance across CSI500 and S&P500 settings |
| Main goal | reduce alpha decay / crowding / homogeneity |
Why It Matters¶
AlphaAgent is the best baseline for:
"Can LLMs generate less crowded, more interpretable, decay-resistant formula alphas?"
It is directly relevant to your Cognition Base / Red Queen idea.
Gap¶
AlphaAgent regularizes factor generation, but it does not solve:
- benchmark fixed dataset/protocol;
- crypto execution;
- fill/tradability constraints;
- strict DSR/PBO;
- cost-tier reporting.
How To Use It¶
Crypto-Alpha-Bench should include AlphaAgent-style regularization as one baseline axis:
- no regularization;
- AST originality only;
- semantic alignment only;
- complexity control only;
- all combined.
Then test which survives crypto mid-frequency execution costs.
3.5 Navigating the Alpha Jungle¶
Source:
- arXiv: https://arxiv.org/abs/2505.11122
What It Is¶
This paper integrates LLMs with Monte Carlo Tree Search for formulaic factor mining.
The abstract's key mechanisms:
- LLM generates/refines symbolic alpha formulas;
- MCTS explores the formula search space;
- quantitative backtest feedback guides search;
- frequent subtree avoidance improves diversity and avoids homogenization.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | symbolic formula tree |
| Generator | LLM prior |
| Search | MCTS |
| Verifier | backtest feedback |
| Anti-collapse | frequent subtree avoidance |
Why It Matters¶
This is the most direct "AlphaProof-style" migration into alpha mining:
- LLM prior;
- tree search;
- external verifier feedback.
Gap¶
It still lacks:
- unified benchmark protocol;
- crypto execution/fill/cost emphasis;
- formal multiple-testing reporting;
- compute-budget fairness across search algorithms.
How To Use It¶
Crypto-Alpha-Bench should include an LLM+MCTS baseline:
- same primitive set;
- same expression grammar;
- same expression budget;
- same cost-tier evaluation;
- compare against random, EA, GFlowNet, QuantaAlpha-style trajectory search.
3.6 AlphaSAGE¶
Source:
- arXiv: https://arxiv.org/abs/2509.25055
What It Is¶
AlphaSAGE uses GFlowNets for structure-aware alpha mining. It addresses three problems in RL alpha generation:
- sparse rewards at formula completion;
- inadequate sequential representation of expression structure;
- single-mode optimization, which conflicts with the need for diverse non-correlated alphas.
The abstract lists three innovations:
- RGCN-based structure-aware encoder;
- GFlowNet generation;
- dense multi-faceted reward.
Architecture / Task Design¶
| Component | Details |
|---|---|
| Search unit | expression graph / formula structure |
| Generator | GFlowNet |
| Encoder | RGCN structure-aware encoder |
| Objective | diverse high-reward modes |
| Verifier | alpha quality rewards |
Why It Matters¶
For Crypto-Alpha-Bench, AlphaSAGE is important because not all strong alpha search baselines are LLM agents.
It is a strong baseline for:
diverse factor portfolio generation.
Gap¶
It does not address:
- LLM agent workflow;
- executable crypto costs/fill;
- benchmark protocol;
- DSR/PBO;
- human expert baseline.
How To Use It¶
Include a GFlowNet baseline in the benchmark after v0:
- if it beats LLM search under equal compute, that is a strong negative result against LLM-first alpha mining;
- if LLM+GFlowNet hybrid wins, that motivates Cognition Base + structure-aware exploration.
3.7 Chain-of-Alpha¶
Source:
- arXiv: https://arxiv.org/abs/2508.06312
Important status:
- The arXiv page says this paper has been withdrawn / removed by administrators due to license-right issues.
What It Claimed¶
The abstract describes a dual-chain architecture:
- Factor Generation Chain;
- Factor Optimization Chain;
- iterative generate-evaluate-refine loop using market data, backtest feedback, and prior optimization knowledge.
How To Treat It¶
Do not cite this as a stable SOTA claim in a professor-facing deck.
Use it only as:
evidence that the field is converging on iterative LLM alpha search loops.
Better stable substitutes:
- AlphaBench;
- AlphaAgent;
- Alpha Jungle;
- QuantaAlpha.
3.8 Hubble¶
Source:
- arXiv: https://arxiv.org/abs/2604.09601
What It Is¶
Hubble is a 2026 LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery.
Its core design is highly relevant to your verifier thesis:
- constrain generation with a domain-specific operator language;
- execute formulas through an AST sandbox instead of arbitrary code;
- use dual-channel RAG and family-aware selection;
- score candidates through a deterministic cross-sectional pipeline;
- feed back top formulas and structured diagnostics into later rounds.
The current arXiv abstract reports a U.S. equity universe of roughly 500 stocks, 104 valid candidates across three rounds, zero runtime crashes, and held-out validation from 2025-06-01 to 2026-03-13.
Why It Matters¶
Hubble weakens any naive claim that "existing LLM alpha systems do not care about safety or reproducibility." It explicitly cares about:
- executable safety;
- formula validity;
- diversity;
- interpretability;
- post-hoc diagnostics;
- held-out validation.
This is close to the language you want for Crypto-Alpha-Bench.
Gap For Crypto-Alpha-Bench¶
Hubble is still primarily:
- equity/formula factor discovery;
- daily/cross-sectional rather than crypto mid-frequency;
- not centered on fill simulation, transaction-cost tiers, capacity, DSR/PBO, or leaderboard-style benchmark control.
How To Use It¶
Use Hubble as a safe generation baseline:
"Hubble shows the right direction for safe LLM factor generation. Crypto-Alpha-Bench asks whether such safe generated factors survive executable crypto trading constraints."
3.9 FactorMiner¶
Source:
- arXiv: https://arxiv.org/abs/2602.14670
What It Is¶
FactorMiner is a self-evolving alpha discovery agent built around skills and experience memory.
The key loop is the Ralph Loop:
- retrieve prior experience;
- generate factor candidates;
- evaluate candidates through modular tools;
- distill successful patterns and failure constraints back into memory.
The abstract emphasizes the "Correlation Red Sea" problem: as a factor library grows, new factors become increasingly redundant. FactorMiner tries to reduce redundant search using accumulated memory and modular evaluation tools.
Why It Matters¶
This is highly aligned with your Cognition Base idea.
FactorMiner is basically saying:
Alpha discovery needs not just better prompts, but memory over prior trials and systematic distillation of what worked and failed.
That overlaps with your hypothesis that a replication-aware financial Cognition Base may be the hidden variable behind compute-scaled discovery.
Gap For Crypto-Alpha-Bench¶
FactorMiner is not enough by itself because:
- it optimizes discovery workflow, not public benchmark infrastructure;
- cost/fill/tradability are not the central contribution;
- DSR/PBO are not the main evaluation object;
- crypto perpetual microstructure is not the target substrate.
How To Use It¶
Use FactorMiner as:
- a memory-augmented alpha-search baseline;
- a literature anchor for your Cognition Base RQ;
- a reason to make your benchmark record full search trajectories, not just final formulas.
3.10 CogAlpha¶
Source:
- arXiv: https://arxiv.org/abs/2511.18850
What It Is¶
CogAlpha, from Cognitive Alpha Mining via LLM-Driven Code-Based Evolution, combines:
- code-level alpha representation;
- LLM-driven reasoning;
- evolutionary mutation and recombination;
- financial feedback over generated alpha candidates.
The paper positions formula-only and neural approaches as too narrow, opaque, redundant, or economically ungrounded, then argues for broader structured exploration using LLMs as adaptive cognitive agents.
Why It Matters¶
CogAlpha is a direct competitor if Crypto-Alpha-Bench includes code-producing agents.
It moves beyond "LLM writes formula" toward:
- executable code as the search unit;
- evolutionary refinement;
- interpretability through readable strategy logic;
- broader search-space coverage.
Gap For Crypto-Alpha-Bench¶
The gap is again benchmark discipline:
- A-share equity setting rather than crypto perpetuals;
- no fixed public crypto task protocol;
- no explicit fill/cost/capacity standard;
- no compute-controlled leaderboard;
- no central multiple-testing correction claim.
How To Use It¶
Treat CogAlpha as a code-evolution baseline. It is especially useful if your benchmark allows agents to submit runnable strategy code rather than only formula expressions.
3.11 Beyond Prompting¶
Source:
- arXiv: https://arxiv.org/abs/2603.14288
What It Is¶
Beyond Prompting: An Autonomous Framework for Systematic Factor Investing via Agentic AI develops a self-directed agentic framework for systematic factor investing.
The abstract highlights:
- autonomous formulation of interpretable trading signals;
- out-of-sample validation;
- economic rationale requirements;
- U.S. equity long-short portfolios;
- reported annualized Sharpe ratio of 3.11 and return of 59.53%.
Why It Matters¶
This is important for professor Q&A because it may sound very close to "agentic quant research."
It explicitly tries to move from manual prompting to a self-directed engine.
Gap For Crypto-Alpha-Bench¶
The main weakness for your agenda is that it is not a benchmark environment:
- it presents an autonomous factor-investing framework;
- it does not define a reusable crypto alpha-search leaderboard;
- transaction-cost/fill/capacity modeling is not the core public protocol;
- it does not solve cross-agent comparability.
How To Use It¶
Position it as an agent architecture competitor, not a benchmark competitor.
Say:
"Autonomous factor-investing agents are arriving. That makes a hard, public verifier even more urgent."
3.12 Alpha-GPT¶
Sources:
- ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
- arXiv: https://arxiv.org/abs/2308.00016
What It Is¶
Alpha-GPT is a human-AI interactive alpha mining system, published at EMNLP 2025 System Demonstrations.
It introduces a workflow where the human quant researcher supplies or iterates on ideas, and the LLM framework turns those ideas into candidate alphas. The ACL abstract reports that Alpha-GPT ranked top-10 among over 41,000 teams in the WorldQuant International Quant Championship.
Why It Matters¶
This is the best reference for your human expert in the loop angle.
It does not claim that the agent should replace the quant researcher. Instead, it frames LLMs as a way to implement and expand human alpha hypotheses.
That is close to your revised plan:
compare LLM auto-search against discretionary human experts, then study hybrid loops.
Gap For Crypto-Alpha-Bench¶
Alpha-GPT is:
- interactive rather than benchmark-controlled;
- not crypto-specific;
- not centered on cost/fill/statistical correction;
- hard to compare under fixed compute budgets because human interaction is part of the loop.
How To Use It¶
Use Alpha-GPT to justify a human-in-loop track:
- autonomous track: agent gets dataset + API + budget;
- assisted track: human expert can steer agent;
- human-only track: discretionary researcher baseline.
This gives the benchmark a richer and more realistic comparison.
4. Trading Decision Agents / Neighboring Systems¶
These are not direct formula-alpha-search baselines, but they matter because professors may ask whether your benchmark is really about trading agents rather than alpha mining.
4.1 TradingAgents¶
Source:
- arXiv: https://arxiv.org/abs/2412.20138
What It Is¶
TradingAgents is a multi-agent LLM trading framework inspired by a trading firm. The abstract lists specialized roles:
- fundamental analysts;
- sentiment analysts;
- technical analysts;
- bull/bear researchers;
- risk management team;
- traders with varied risk profiles.
Why It Matters¶
TradingAgents is a strong example of:
collaborative LLM trading-decision workflow.
It is useful for RQ2 / open-world agent safety and human-readable decision process.
Why It Is Not Your Main Baseline¶
It outputs trading decisions, not formulaic alpha factors or benchmarkable alpha-search trajectories.
For Crypto-Alpha-Bench:
- include as neighboring context;
- not a primary formula-mining baseline.
4.2 QuantAgent¶
Source:
- arXiv: https://arxiv.org/abs/2509.09995
What It Is¶
QuantAgent is a price-driven multi-agent LLM framework for short-horizon / HFT-style trading. The abstract says it decomposes trading into four agents:
- Indicator;
- Pattern;
- Trend;
- Risk.
It targets structured short-horizon signals rather than long-horizon text/fundamental reasoning.
Why It Matters¶
QuantAgent is the closest neighboring system to your crypto mid-frequency / microstructure framing.
It explicitly criticizes long-horizon LLM trading agents as ill-suited for high-speed precision-critical trading, which overlaps with your concern.
Gap¶
But according to the abstract:
- evaluation focuses on predictive accuracy and trading metrics across 1-hour and 4-hour intervals;
- it is not an alpha-search benchmark;
- fill / queue / adverse selection / DSR/PBO are not central;
- "HFT" is used broadly; it does not solve sub-second execution realism.
How To Use It¶
Use QuantAgent as a neighboring baseline / framing contrast:
QuantAgent says structured signals matter for short-horizon trading. Crypto-Alpha-Bench turns that into a fixed benchmark with cost/fill/statistical constraints.
4.3 FinMem¶
Source:
- arXiv: https://arxiv.org/abs/2311.13743
What It Is¶
FinMem is a memory-enhanced LLM trading agent. The abstract lists three modules:
- Profiling;
- layered Memory;
- Decision-making.
It aims to imitate aspects of human trader cognition and improve stock trading outcomes.
Why It Matters¶
FinMem is relevant to the human expert in the loop revision:
- layered memory;
- trader-like cognitive structure;
- self-evolving professional knowledge;
- real-time tuning.
Gap¶
It is not primarily:
- formula alpha search;
- benchmark infrastructure;
- crypto execution/fill evaluation;
- multiple-testing-corrected alpha discovery.
How To Use It¶
Use it as a baseline for:
- memory architecture;
- tacit-knowledge extraction;
- discretionary-agent imitation.
Do not use it as a core Crypto-Alpha-Bench alpha-mining baseline.
4.4 FinAgent¶
Source:
- arXiv: https://arxiv.org/abs/2402.18485
What It Is¶
FinAgent is a multimodal foundation agent for financial trading. The abstract describes:
- tool augmentation;
- multimodal market intelligence over numerical, textual, and visual data;
- dual-level reflection;
- diversified memory retrieval;
- reasoning for actions;
- integration of trading strategies and expert insights.
Why It Matters¶
FinAgent is the strongest representative of:
multimodal generalist financial trading agents.
It helps position the human-expert-in-loop and multimodal context-window direction.
Gap¶
FinAgent is broad; Crypto-Alpha-Bench should be narrower:
- fixed crypto mid-frequency data;
- executable-alpha metrics;
- cost and fill realism;
- factor-search protocol;
- statistical rigor.
4.5 FinRobot¶
Source:
- arXiv: https://arxiv.org/abs/2405.14767
What It Is¶
FinRobot is an open-source AI agent platform for financial applications using LLMs. It is a platform rather than a single alpha-mining model.
The abstract describes layers:
- Financial AI Agents;
- Financial LLM Algorithms;
- LLMOps and DataOps;
- Multi-source LLM foundation models.
Why It Matters¶
FinRobot is useful as:
- platform reference;
- LLMOps/DataOps architecture reference;
- not a direct competitor benchmark.
Gap¶
It does not define the executable crypto alpha benchmark you need.
5. Evaluation Frameworks¶
5.1 AlphaEval¶
Source:
- arXiv: https://arxiv.org/abs/2508.13174
What It Is¶
AlphaEval is a comprehensive and efficient evaluation framework for formula alpha mining.
Its abstract emphasizes:
- backtesting is expensive and sequential;
- single metrics are incomplete;
- five-dimensional evaluation:
- predictive power;
- stability;
- robustness to market perturbations;
- financial logic;
- diversity.
How It Relates To AlphaBench¶
- AlphaBench = benchmark for LLM capabilities in formulaic alpha mining.
- AlphaEval = evaluation framework for generated formula alphas.
How It Relates To Crypto-Alpha-Bench¶
Crypto-Alpha-Bench can use AlphaEval-like dimensions, then add:
- fixed public crypto data;
- explicit cost tiers;
- fill/tradability gates;
- compute-control;
- synthetic ground truth;
- DSR/PBO;
- human expert baseline.
Soundbite¶
AlphaEval gives the evaluation axes; Crypto-Alpha-Bench gives the executable benchmark substrate.
6. What Crypto-Alpha-Bench Should Claim¶
6.1 Claims To Avoid¶
Avoid:
- "first alpha mining benchmark";
- "first LLM financial agent benchmark";
- "first alpha auto search framework";
- "first financial trading agent benchmark."
Those are too broad and likely false after AlphaBench / RD-Agent(Q) / AlphaEval.
6.2 Defensible Claim¶
Use:
Crypto-Alpha-Bench is a benchmark for executable crypto mid-frequency alpha search, designed to evaluate not only formula quality but cost-adjusted, fill-aware, statistically corrected, compute-controlled discovery.
Even sharper:
Existing benchmarks test whether LLMs can generate alpha formulas. Crypto-Alpha-Bench tests whether alpha-search systems can discover tradable crypto alphas under realistic execution and statistical constraints.
6.3 Minimum Differentiators¶
Do not launch Crypto-Alpha-Bench without these:
- Fixed crypto perp dataset
- e.g. Binance USD-M top-N perps;
- versioned manifest;
-
gap handling protocol.
-
Three cost tiers
- optimistic;
- realistic;
- pessimistic;
-
all leaderboard submissions report all three.
-
Fill / tradability gate
- spread;
- top-of-book notional;
- depth;
- adverse selection proxy;
-
partial-fill handling.
-
Statistical rigor
- DSR;
- PBO / CSCV;
-
null-search baseline.
-
Compute-control
- token budget;
- wall-clock;
- GPU/CPU tier;
-
number of candidate evaluations.
-
Synthetic ground truth
- known alpha process;
- known regime shift;
-
known execution-cost sensitivity.
-
Reference baselines
- random;
- gplearn / GP;
- AlphaBench-style CoE/ToT/EA;
- AlphaAgent regularized LLM;
- LLM+MCTS;
- GFlowNet / AlphaSAGE-style;
- your M8.6 tradability baseline.
Optional but distinctive:
- Human expert discretionary baseline
- only if trader cooperation is realistic.
7. Recommended Survey Workplan¶
7.1 Stage 1 · Close Reading Priority¶
Read in this order:
- AlphaBench
- RD-Agent(Q)
- Hubble
- FactorMiner
- QuantaAlpha
- CogAlpha
- AlphaAgent
- Navigating the Alpha Jungle
- AlphaSAGE
- AlphaEval
- Beyond Prompting
- Alpha-GPT
- QuantAgent
- TradingAgents
- FinMem / FinAgent
7.2 Stage 2 · Extraction Template¶
For each paper/system:
Title:
Year / venue:
Task:
Search unit:
Generator:
Verifier:
Feedback granularity:
Data:
Cost model:
Statistical rigor:
Compute control:
Reproducibility:
Main reported result:
Main weakness:
What Crypto-Alpha-Bench should borrow:
What Crypto-Alpha-Bench should beat:
7.3 Stage 3 · Benchmark Design Decision¶
After reading:
- if AlphaBench already covers enough of formula mining, narrow Crypto-Alpha-Bench to executable crypto alphas;
- if RD-Agent(Q) is too strong as a system, avoid competing as "agent architecture"; compete as "hard evaluation environment";
- if QuantaAlpha / AlphaAgent / MCTS outperform simple baselines, include them as reference baselines rather than reinvent them.
8. Updated HKU Talk Adjustment¶
Add one slide before "The Field Has No ImageNet Moment":
Slide: Existing SOTA Is Close, But Not The Same¶
| Existing Work | Covers | Missing For My Goal |
|---|---|---|
| AlphaBench | LLM formula alpha benchmark | executable crypto cost/fill/statistics |
| RD-Agent(Q) | full quant R&D agent | benchmark substrate |
| Hubble / FactorMiner / CogAlpha | safe generation, memory, code evolution | common executable crypto protocol |
| AlphaAgent / QuantaAlpha / Alpha Jungle | search methods | fixed benchmark discipline |
| Alpha-GPT | human-AI interactive alpha mining | controlled human/agent comparison |
| TradingAgents / QuantAgent | trading decisions | alpha-search benchmark |
| AlphaEval | evaluation dimensions | fixed benchmark + tradability |
Then say:
"So the claim is not that nothing exists. The claim is that these systems stop before the executable crypto-alpha layer I need."
This makes the proposal much more robust.
9. Final Positioning¶
Best one-sentence positioning:
Crypto-Alpha-Bench is not another LLM alpha-mining agent. It is the executable verifier and benchmark layer that current alpha-mining agents lack.
Best professor-facing version:
"I want to benchmark not just whether an agent can produce a plausible factor, but whether the factor survives costs, fills, time-slice instability, multiple testing, and compute-controlled comparison in crypto markets."
10. Source Index¶
- AlphaBench project: https://alphabench.cc/
- AlphaBench OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ
- RD-Agent(Q) Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
- RD-Agent(Q) arXiv: https://arxiv.org/abs/2505.15155
- QuantaAlpha: https://arxiv.org/abs/2602.07085
- AlphaAgent: https://arxiv.org/abs/2502.16789
- Navigating the Alpha Jungle: https://arxiv.org/abs/2505.11122
- AlphaSAGE: https://arxiv.org/abs/2509.25055
- Chain-of-Alpha: https://arxiv.org/abs/2508.06312
- Hubble: https://arxiv.org/abs/2604.09601
- FactorMiner: https://arxiv.org/abs/2602.14670
- CogAlpha: https://arxiv.org/abs/2511.18850
- Beyond Prompting: https://arxiv.org/abs/2603.14288
- Alpha-GPT ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
- Alpha-GPT arXiv: https://arxiv.org/abs/2308.00016
- TradingAgents: https://arxiv.org/abs/2412.20138
- QuantAgent: https://arxiv.org/abs/2509.09995
- FinMem: https://arxiv.org/abs/2311.13743
- FinAgent: https://arxiv.org/abs/2402.18485
- FinRobot: https://arxiv.org/abs/2405.14767
- AlphaEval: https://arxiv.org/abs/2508.13174