Skip to content

Financial SOTA Agent Survey for Crypto-Alpha-Bench

Purpose: map current financial SOTA agents / alpha-mining systems before committing to Crypto-Alpha-Bench.

Updated: 2026-05-19

Core question: What existing agents and benchmarks already cover alpha auto search, and what gap remains for Crypto-Alpha-Bench?


0. Executive Verdict

Yes: before positioning Crypto-Alpha-Bench, we must survey current financial SOTA agents and alpha-mining benchmarks.

The important conclusion is not "nobody has done this." The important conclusion is:

Existing work already covers formula alpha generation, factor evaluation, iterative LLM search, multi-agent quant R&D, and trading-decision agents. Crypto-Alpha-Bench must therefore position itself as an executable crypto mid-frequency alpha benchmark with fixed data, cost tiers, fill/tradability constraints, DSR/PBO, compute control, synthetic ground truth, and optional human expert baseline.

The biggest direct threats:

  1. AlphaBench: already claims the first systematic benchmark for LLMs in formulaic alpha factor mining.
  2. RD-Agent(Q): already claims full-stack multi-agent quant R&D with factor-model joint optimization.
  3. QuantaAlpha / AlphaAgent / Alpha Jungle: already cover LLM-driven formula alpha search with increasingly sophisticated exploration and regularization.
  4. Hubble / FactorMiner / CogAlpha / Beyond Prompting: 2025-2026 work pushes toward safer sandboxed generation, memory-driven self-evolution, code-level evolution, and autonomous factor investing.

The clean gap:

Most existing systems optimize formula alpha quality or trading-decision performance. They do not jointly enforce crypto-specific executable-alpha constraints: public crypto perp data, cost tiers, fill/tradability gates, compute-controlled search, multiple-testing correction, synthetic ground truth, and human discretionary baseline.


1. Taxonomy: What Counts As "Financial Agent SOTA"?

Use four buckets. Do not mix them.

Bucket What It Optimizes Typical Output Direct Threat To Crypto-Alpha-Bench?
A. Alpha Mining Agents Discover formulaic alphas / factors alpha expression, factor pool High
B. Full Quant R&D Agents Automate research workflow: hypothesis → code → backtest → feedback factor + model + report High
C. Trading Decision Agents Decide buy/sell/hold or portfolio action from multimodal context trading action Medium
D. Benchmark / Evaluation Frameworks Standardize tasks and metrics leaderboard / protocol High

Crypto-Alpha-Bench should mostly compare against A/B/D, while using C as neighboring context.


2. Benchmark Gap Matrix

This is the single most important table.

System Task Search Unit Verifier Data / Market Cost / Fill Multiple Testing Compute Control Main Gap For Crypto-Alpha-Bench
AlphaBench LLM formulaic alpha mining benchmark formula expression Qlib backtest + factor metrics CSI300 + SP500, 2020-2025 not the central axis not central analyzes model/settings, but not a strict compute-budget benchmark no crypto perp executable-alpha protocol
RD-Agent(Q) full quant R&D automation factor + model code real-market backtests + feedback stage stock markets not the central axis not emphasized MAB scheduling, but benchmark compute control not the core strong agent; weak public benchmark substrate
QuantaAlpha LLM-driven evolutionary alpha mining mining trajectory + factor backtest / IC / return metrics CSI300 + transfer to CSI500/SP500 not central not central trajectory reuse, but not fixed budget protocol trajectory search is useful baseline, not full benchmark
AlphaAgent decay-resistant LLM alpha mining formula expression market evaluation across regimes CSI500 + SP500 not central regularizes decay but not DSR/PBO not core useful regularization baseline
Alpha Jungle LLM + MCTS formula mining formula expression / tree backtest feedback stock market data not central not central MCTS budget implicit search baseline, not benchmark protocol
AlphaSAGE GFlowNet alpha mining expression graph / factor portfolio multi-faceted reward stock alpha mining not central not central GFlowNet exploration budget strong non-LLM alpha-search baseline
Hubble safe/reproducible LLM alpha discovery operator tree AST sandbox + cross-sectional metrics U.S. equities turnover included, but fill/cost not core OOS/HAC evidence, not DSR/PBO benchmark fixed rounds/candidate counts closest to safe verifier framing, still equity/formula-centric
FactorMiner self-evolving alpha discovery formulaic factor + memory modular evaluation tools multi-asset datasets not central redundancy/correlation control, not DSR/PBO lightweight iterative loop strong memory baseline, not crypto executable benchmark
CogAlpha LLM-driven code evolution executable alpha code evolutionary fitness feedback A-share equities not central robustness/generalization claimed, not benchmark protocol evolutionary budget implicit useful code-evolution baseline
Beyond Prompting autonomous systematic factor investing interpretable signal set OOS validation + economic rationale U.S. equities not central data-snooping addressed qualitatively not core strong autonomous-agent story, but not public benchmark substrate
Alpha-GPT human-AI interactive alpha mining human idea -> alpha alpha mining experiments / competition evaluation WorldQuant-style alpha context not central not central interactive workflow, not controlled budget useful human-in-loop baseline
TradingAgents collaborative trading-decision agents buy/sell/trade decision portfolio returns, Sharpe, MDD stocks not central not central not core trading chatbot/team, not alpha benchmark
QuantAgent short-horizon / HFT-style LLM trading structured-signal decision predictive accuracy + trading metrics 9 instruments incl. BTC/Nasdaq futures not rigorous fill sim not central not core relevant to crypto HFT framing, but not formula alpha mining
FinMem memory-augmented LLM trading investment decision stock trading performance stocks not central not central not core memory architecture reference, not benchmark
FinAgent multimodal foundation trading agent trading action 6 financial metrics stocks + crypto datasets not central not central not core multimodal trading baseline, not alpha search
AlphaEval alpha evaluation framework generated alpha 5D backtest-free evaluation formula alphas cost/tradability not central not central in abstract not core evaluation backbone, not benchmark substrate

Result:

The benchmark claim must not be "first alpha benchmark." It should be "first executable crypto mid-frequency alpha-search benchmark with cost/fill/statistical rigor."


3. Direct Competitors / Must-Read Systems

3.1 AlphaBench

Sources:

  • Project: https://alphabench.cc/
  • OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ

What It Is

AlphaBench is an ICLR 2026 benchmark for evaluating LLMs in Formulaic Alpha Factor Mining (FAFM). It covers three core tasks:

  1. factor generation;
  2. factor evaluation;
  3. iterative factor searching.

The project page says its toolchain includes 1,857 instructions, an FFO execution engine, and Qlib-based backtesting. It evaluates generation settings and search paradigms such as Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms.

Architecture / Task Design

Component Details
Search unit formulaic alpha factor expression
Generator LLMs under different prompting/reasoning/search settings
Verifier executable formula engine + Qlib backtesting
Tasks Text2Alpha, Directional Mining, FactorEval, CoE, ToT, EA
Dataset CSI300 and SP500, 2020-2025 according to project page
Metrics IC, RankIC, robustness, win rate, skewness, reliability, stability, semantic alignment

What It Solves

AlphaBench is the strongest existing answer to:

"Can LLMs generate, evaluate, and search formulaic alphas?"

It gives a standardized FAFM benchmark and should be treated as the closest benchmark sibling.

What It Does Not Solve For Us

Crypto-Alpha-Bench should differ on:

  • crypto perpetual futures instead of stock FAFM only;
  • mid-frequency executable alpha, not just formula validity / stock factor quality;
  • explicit transaction-cost tiers;
  • fill/tradability gate;
  • DSR/PBO / multiple-testing discipline;
  • compute-controlled search budgets;
  • synthetic ground-truth task;
  • optional human expert discretionary baseline.

How To Position Against It

Say:

AlphaBench is the closest existing benchmark for LLM formula alpha mining. Crypto-Alpha-Bench should adopt its task decomposition, but extend the benchmark target from formula quality to executable crypto mid-frequency alpha under cost, fill, and statistical constraints.

Do not say:

"There is no alpha benchmark."

That is no longer defensible.


3.2 RD-Agent(Q)

Sources:

  • Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
  • arXiv: https://arxiv.org/abs/2505.15155
  • Docs: https://rdagent.readthedocs.io/en/latest/scens/quant_agent_fin.html

What It Is

RD-Agent(Q) is a data-centric multi-agent framework for automated quant R&D. It targets factor-model joint optimization, not only formula alpha mining.

From the Microsoft abstract, it decomposes quant research into:

  • a Research stage that sets goal-aligned prompts, formulates hypotheses from domain priors, and maps them to tasks;
  • a Development stage using a code-generation agent, Co-STEER, to implement task-specific code;
  • a Feedback stage that evaluates real-market backtests and informs later iterations;
  • a multi-armed bandit scheduler for adaptive direction selection.

Architecture / Task Design

Component Details
Search unit factor ideas, model innovations, code
Generator multi-agent hypothesis and code-generation workflow
Verifier executable code + real-market backtests
Feedback analysis unit + MAB scheduler
Knowledge grounding domain priors / knowledge forest / prior outcomes
Claim up to 2x annualized return vs classical factor libraries with fewer factors, according to abstract

What It Solves

RD-Agent(Q) is the strongest "full quant R&D automation" baseline.

It addresses:

  • hypothesis generation;
  • code implementation;
  • backtest verification;
  • factor-model co-design;
  • adaptive allocation of research effort.

What It Does Not Solve For Us

It is not primarily a public benchmark protocol:

  • no crypto-specific executable benchmark;
  • no explicit public cost-tier/fill/tradability protocol as core contribution;
  • no central DSR/PBO benchmark claim;
  • no human expert discretionary baseline;
  • no fixed leaderboard framing.

How To Position Against It

Say:

RD-Agent(Q) is a model of what the research agent could become. Crypto-Alpha-Bench is the evaluation substrate such an agent would need in crypto mid-frequency settings.

In other words:

  • RD-Agent(Q) = strong agent architecture.
  • Crypto-Alpha-Bench = hard verifier / benchmark environment.

3.3 QuantaAlpha

Source:

  • arXiv: https://arxiv.org/abs/2602.07085

What It Is

QuantaAlpha is an evolutionary LLM-driven alpha mining framework. It treats each end-to-end mining run as a trajectory and improves factors using trajectory-level mutation and crossover.

According to the abstract, it:

  • localizes suboptimal steps for targeted revision;
  • recombines high-reward trajectory segments;
  • enforces semantic consistency across hypothesis, factor expression, and executable code;
  • constrains complexity and redundancy to reduce crowding.

Architecture / Task Design

Component Details
Search unit trajectory of a mining run, not just a formula
Generator LLM-driven factor generation + evolutionary mutation/crossover
Verifier IC / ARR / MDD / backtest-style metrics
Data CSI300, with transfer claims to CSI500 and S&P500 according to abstract
Novelty reuse validated experience at trajectory level

Why It Matters

This is the natural next step after AlphaAgent / Alpha Jungle:

  • AlphaAgent regularizes individual factor generation.
  • Alpha Jungle improves tree search.
  • QuantaAlpha improves whole search trajectories.

For your benchmark, QuantaAlpha should be a strong search-agent baseline.

Gap

Still not enough for Crypto-Alpha-Bench:

  • equity-centric;
  • not a public executable crypto benchmark;
  • transaction cost / fill / capacity not central;
  • multiple-testing discipline not central;
  • compute-control not the main object.

How To Use It

In Crypto-Alpha-Bench:

  • include a QuantaAlpha-style trajectory search baseline if implementation is available;
  • require it to report compute budget and DSR/PBO;
  • compare its trajectory reuse against simpler EA/ToT/CoE search under equal budget.

3.4 AlphaAgent

Source:

  • arXiv: https://arxiv.org/abs/2502.16789

What It Is

AlphaAgent is an autonomous LLM-driven alpha mining framework focused on alpha decay resistance.

The abstract highlights three mechanisms:

  1. AST-based originality enforcement against existing alphas;
  2. LLM-evaluated hypothesis-factor semantic alignment;
  3. AST-based complexity control to prevent over-engineered formulas.

Architecture / Task Design

Component Details
Search unit formulaic alpha expression
Generator LLM agent
Regularizer originality, semantic alignment, complexity
Verifier market performance across CSI500 and S&P500 settings
Main goal reduce alpha decay / crowding / homogeneity

Why It Matters

AlphaAgent is the best baseline for:

"Can LLMs generate less crowded, more interpretable, decay-resistant formula alphas?"

It is directly relevant to your Cognition Base / Red Queen idea.

Gap

AlphaAgent regularizes factor generation, but it does not solve:

  • benchmark fixed dataset/protocol;
  • crypto execution;
  • fill/tradability constraints;
  • strict DSR/PBO;
  • cost-tier reporting.

How To Use It

Crypto-Alpha-Bench should include AlphaAgent-style regularization as one baseline axis:

  • no regularization;
  • AST originality only;
  • semantic alignment only;
  • complexity control only;
  • all combined.

Then test which survives crypto mid-frequency execution costs.


3.5 Navigating the Alpha Jungle

Source:

  • arXiv: https://arxiv.org/abs/2505.11122

What It Is

This paper integrates LLMs with Monte Carlo Tree Search for formulaic factor mining.

The abstract's key mechanisms:

  • LLM generates/refines symbolic alpha formulas;
  • MCTS explores the formula search space;
  • quantitative backtest feedback guides search;
  • frequent subtree avoidance improves diversity and avoids homogenization.

Architecture / Task Design

Component Details
Search unit symbolic formula tree
Generator LLM prior
Search MCTS
Verifier backtest feedback
Anti-collapse frequent subtree avoidance

Why It Matters

This is the most direct "AlphaProof-style" migration into alpha mining:

  • LLM prior;
  • tree search;
  • external verifier feedback.

Gap

It still lacks:

  • unified benchmark protocol;
  • crypto execution/fill/cost emphasis;
  • formal multiple-testing reporting;
  • compute-budget fairness across search algorithms.

How To Use It

Crypto-Alpha-Bench should include an LLM+MCTS baseline:

  • same primitive set;
  • same expression grammar;
  • same expression budget;
  • same cost-tier evaluation;
  • compare against random, EA, GFlowNet, QuantaAlpha-style trajectory search.

3.6 AlphaSAGE

Source:

  • arXiv: https://arxiv.org/abs/2509.25055

What It Is

AlphaSAGE uses GFlowNets for structure-aware alpha mining. It addresses three problems in RL alpha generation:

  • sparse rewards at formula completion;
  • inadequate sequential representation of expression structure;
  • single-mode optimization, which conflicts with the need for diverse non-correlated alphas.

The abstract lists three innovations:

  • RGCN-based structure-aware encoder;
  • GFlowNet generation;
  • dense multi-faceted reward.

Architecture / Task Design

Component Details
Search unit expression graph / formula structure
Generator GFlowNet
Encoder RGCN structure-aware encoder
Objective diverse high-reward modes
Verifier alpha quality rewards

Why It Matters

For Crypto-Alpha-Bench, AlphaSAGE is important because not all strong alpha search baselines are LLM agents.

It is a strong baseline for:

diverse factor portfolio generation.

Gap

It does not address:

  • LLM agent workflow;
  • executable crypto costs/fill;
  • benchmark protocol;
  • DSR/PBO;
  • human expert baseline.

How To Use It

Include a GFlowNet baseline in the benchmark after v0:

  • if it beats LLM search under equal compute, that is a strong negative result against LLM-first alpha mining;
  • if LLM+GFlowNet hybrid wins, that motivates Cognition Base + structure-aware exploration.

3.7 Chain-of-Alpha

Source:

  • arXiv: https://arxiv.org/abs/2508.06312

Important status:

  • The arXiv page says this paper has been withdrawn / removed by administrators due to license-right issues.

What It Claimed

The abstract describes a dual-chain architecture:

  • Factor Generation Chain;
  • Factor Optimization Chain;
  • iterative generate-evaluate-refine loop using market data, backtest feedback, and prior optimization knowledge.

How To Treat It

Do not cite this as a stable SOTA claim in a professor-facing deck.

Use it only as:

evidence that the field is converging on iterative LLM alpha search loops.

Better stable substitutes:

  • AlphaBench;
  • AlphaAgent;
  • Alpha Jungle;
  • QuantaAlpha.

3.8 Hubble

Source:

  • arXiv: https://arxiv.org/abs/2604.09601

What It Is

Hubble is a 2026 LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery.

Its core design is highly relevant to your verifier thesis:

  • constrain generation with a domain-specific operator language;
  • execute formulas through an AST sandbox instead of arbitrary code;
  • use dual-channel RAG and family-aware selection;
  • score candidates through a deterministic cross-sectional pipeline;
  • feed back top formulas and structured diagnostics into later rounds.

The current arXiv abstract reports a U.S. equity universe of roughly 500 stocks, 104 valid candidates across three rounds, zero runtime crashes, and held-out validation from 2025-06-01 to 2026-03-13.

Why It Matters

Hubble weakens any naive claim that "existing LLM alpha systems do not care about safety or reproducibility." It explicitly cares about:

  • executable safety;
  • formula validity;
  • diversity;
  • interpretability;
  • post-hoc diagnostics;
  • held-out validation.

This is close to the language you want for Crypto-Alpha-Bench.

Gap For Crypto-Alpha-Bench

Hubble is still primarily:

  • equity/formula factor discovery;
  • daily/cross-sectional rather than crypto mid-frequency;
  • not centered on fill simulation, transaction-cost tiers, capacity, DSR/PBO, or leaderboard-style benchmark control.

How To Use It

Use Hubble as a safe generation baseline:

"Hubble shows the right direction for safe LLM factor generation. Crypto-Alpha-Bench asks whether such safe generated factors survive executable crypto trading constraints."


3.9 FactorMiner

Source:

  • arXiv: https://arxiv.org/abs/2602.14670

What It Is

FactorMiner is a self-evolving alpha discovery agent built around skills and experience memory.

The key loop is the Ralph Loop:

  1. retrieve prior experience;
  2. generate factor candidates;
  3. evaluate candidates through modular tools;
  4. distill successful patterns and failure constraints back into memory.

The abstract emphasizes the "Correlation Red Sea" problem: as a factor library grows, new factors become increasingly redundant. FactorMiner tries to reduce redundant search using accumulated memory and modular evaluation tools.

Why It Matters

This is highly aligned with your Cognition Base idea.

FactorMiner is basically saying:

Alpha discovery needs not just better prompts, but memory over prior trials and systematic distillation of what worked and failed.

That overlaps with your hypothesis that a replication-aware financial Cognition Base may be the hidden variable behind compute-scaled discovery.

Gap For Crypto-Alpha-Bench

FactorMiner is not enough by itself because:

  • it optimizes discovery workflow, not public benchmark infrastructure;
  • cost/fill/tradability are not the central contribution;
  • DSR/PBO are not the main evaluation object;
  • crypto perpetual microstructure is not the target substrate.

How To Use It

Use FactorMiner as:

  • a memory-augmented alpha-search baseline;
  • a literature anchor for your Cognition Base RQ;
  • a reason to make your benchmark record full search trajectories, not just final formulas.

3.10 CogAlpha

Source:

  • arXiv: https://arxiv.org/abs/2511.18850

What It Is

CogAlpha, from Cognitive Alpha Mining via LLM-Driven Code-Based Evolution, combines:

  • code-level alpha representation;
  • LLM-driven reasoning;
  • evolutionary mutation and recombination;
  • financial feedback over generated alpha candidates.

The paper positions formula-only and neural approaches as too narrow, opaque, redundant, or economically ungrounded, then argues for broader structured exploration using LLMs as adaptive cognitive agents.

Why It Matters

CogAlpha is a direct competitor if Crypto-Alpha-Bench includes code-producing agents.

It moves beyond "LLM writes formula" toward:

  • executable code as the search unit;
  • evolutionary refinement;
  • interpretability through readable strategy logic;
  • broader search-space coverage.

Gap For Crypto-Alpha-Bench

The gap is again benchmark discipline:

  • A-share equity setting rather than crypto perpetuals;
  • no fixed public crypto task protocol;
  • no explicit fill/cost/capacity standard;
  • no compute-controlled leaderboard;
  • no central multiple-testing correction claim.

How To Use It

Treat CogAlpha as a code-evolution baseline. It is especially useful if your benchmark allows agents to submit runnable strategy code rather than only formula expressions.


3.11 Beyond Prompting

Source:

  • arXiv: https://arxiv.org/abs/2603.14288

What It Is

Beyond Prompting: An Autonomous Framework for Systematic Factor Investing via Agentic AI develops a self-directed agentic framework for systematic factor investing.

The abstract highlights:

  • autonomous formulation of interpretable trading signals;
  • out-of-sample validation;
  • economic rationale requirements;
  • U.S. equity long-short portfolios;
  • reported annualized Sharpe ratio of 3.11 and return of 59.53%.

Why It Matters

This is important for professor Q&A because it may sound very close to "agentic quant research."

It explicitly tries to move from manual prompting to a self-directed engine.

Gap For Crypto-Alpha-Bench

The main weakness for your agenda is that it is not a benchmark environment:

  • it presents an autonomous factor-investing framework;
  • it does not define a reusable crypto alpha-search leaderboard;
  • transaction-cost/fill/capacity modeling is not the core public protocol;
  • it does not solve cross-agent comparability.

How To Use It

Position it as an agent architecture competitor, not a benchmark competitor.

Say:

"Autonomous factor-investing agents are arriving. That makes a hard, public verifier even more urgent."


3.12 Alpha-GPT

Sources:

  • ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
  • arXiv: https://arxiv.org/abs/2308.00016

What It Is

Alpha-GPT is a human-AI interactive alpha mining system, published at EMNLP 2025 System Demonstrations.

It introduces a workflow where the human quant researcher supplies or iterates on ideas, and the LLM framework turns those ideas into candidate alphas. The ACL abstract reports that Alpha-GPT ranked top-10 among over 41,000 teams in the WorldQuant International Quant Championship.

Why It Matters

This is the best reference for your human expert in the loop angle.

It does not claim that the agent should replace the quant researcher. Instead, it frames LLMs as a way to implement and expand human alpha hypotheses.

That is close to your revised plan:

compare LLM auto-search against discretionary human experts, then study hybrid loops.

Gap For Crypto-Alpha-Bench

Alpha-GPT is:

  • interactive rather than benchmark-controlled;
  • not crypto-specific;
  • not centered on cost/fill/statistical correction;
  • hard to compare under fixed compute budgets because human interaction is part of the loop.

How To Use It

Use Alpha-GPT to justify a human-in-loop track:

  • autonomous track: agent gets dataset + API + budget;
  • assisted track: human expert can steer agent;
  • human-only track: discretionary researcher baseline.

This gives the benchmark a richer and more realistic comparison.


4. Trading Decision Agents / Neighboring Systems

These are not direct formula-alpha-search baselines, but they matter because professors may ask whether your benchmark is really about trading agents rather than alpha mining.

4.1 TradingAgents

Source:

  • arXiv: https://arxiv.org/abs/2412.20138

What It Is

TradingAgents is a multi-agent LLM trading framework inspired by a trading firm. The abstract lists specialized roles:

  • fundamental analysts;
  • sentiment analysts;
  • technical analysts;
  • bull/bear researchers;
  • risk management team;
  • traders with varied risk profiles.

Why It Matters

TradingAgents is a strong example of:

collaborative LLM trading-decision workflow.

It is useful for RQ2 / open-world agent safety and human-readable decision process.

Why It Is Not Your Main Baseline

It outputs trading decisions, not formulaic alpha factors or benchmarkable alpha-search trajectories.

For Crypto-Alpha-Bench:

  • include as neighboring context;
  • not a primary formula-mining baseline.

4.2 QuantAgent

Source:

  • arXiv: https://arxiv.org/abs/2509.09995

What It Is

QuantAgent is a price-driven multi-agent LLM framework for short-horizon / HFT-style trading. The abstract says it decomposes trading into four agents:

  • Indicator;
  • Pattern;
  • Trend;
  • Risk.

It targets structured short-horizon signals rather than long-horizon text/fundamental reasoning.

Why It Matters

QuantAgent is the closest neighboring system to your crypto mid-frequency / microstructure framing.

It explicitly criticizes long-horizon LLM trading agents as ill-suited for high-speed precision-critical trading, which overlaps with your concern.

Gap

But according to the abstract:

  • evaluation focuses on predictive accuracy and trading metrics across 1-hour and 4-hour intervals;
  • it is not an alpha-search benchmark;
  • fill / queue / adverse selection / DSR/PBO are not central;
  • "HFT" is used broadly; it does not solve sub-second execution realism.

How To Use It

Use QuantAgent as a neighboring baseline / framing contrast:

QuantAgent says structured signals matter for short-horizon trading. Crypto-Alpha-Bench turns that into a fixed benchmark with cost/fill/statistical constraints.


4.3 FinMem

Source:

  • arXiv: https://arxiv.org/abs/2311.13743

What It Is

FinMem is a memory-enhanced LLM trading agent. The abstract lists three modules:

  • Profiling;
  • layered Memory;
  • Decision-making.

It aims to imitate aspects of human trader cognition and improve stock trading outcomes.

Why It Matters

FinMem is relevant to the human expert in the loop revision:

  • layered memory;
  • trader-like cognitive structure;
  • self-evolving professional knowledge;
  • real-time tuning.

Gap

It is not primarily:

  • formula alpha search;
  • benchmark infrastructure;
  • crypto execution/fill evaluation;
  • multiple-testing-corrected alpha discovery.

How To Use It

Use it as a baseline for:

  • memory architecture;
  • tacit-knowledge extraction;
  • discretionary-agent imitation.

Do not use it as a core Crypto-Alpha-Bench alpha-mining baseline.


4.4 FinAgent

Source:

  • arXiv: https://arxiv.org/abs/2402.18485

What It Is

FinAgent is a multimodal foundation agent for financial trading. The abstract describes:

  • tool augmentation;
  • multimodal market intelligence over numerical, textual, and visual data;
  • dual-level reflection;
  • diversified memory retrieval;
  • reasoning for actions;
  • integration of trading strategies and expert insights.

Why It Matters

FinAgent is the strongest representative of:

multimodal generalist financial trading agents.

It helps position the human-expert-in-loop and multimodal context-window direction.

Gap

FinAgent is broad; Crypto-Alpha-Bench should be narrower:

  • fixed crypto mid-frequency data;
  • executable-alpha metrics;
  • cost and fill realism;
  • factor-search protocol;
  • statistical rigor.

4.5 FinRobot

Source:

  • arXiv: https://arxiv.org/abs/2405.14767

What It Is

FinRobot is an open-source AI agent platform for financial applications using LLMs. It is a platform rather than a single alpha-mining model.

The abstract describes layers:

  • Financial AI Agents;
  • Financial LLM Algorithms;
  • LLMOps and DataOps;
  • Multi-source LLM foundation models.

Why It Matters

FinRobot is useful as:

  • platform reference;
  • LLMOps/DataOps architecture reference;
  • not a direct competitor benchmark.

Gap

It does not define the executable crypto alpha benchmark you need.


5. Evaluation Frameworks

5.1 AlphaEval

Source:

  • arXiv: https://arxiv.org/abs/2508.13174

What It Is

AlphaEval is a comprehensive and efficient evaluation framework for formula alpha mining.

Its abstract emphasizes:

  • backtesting is expensive and sequential;
  • single metrics are incomplete;
  • five-dimensional evaluation:
  • predictive power;
  • stability;
  • robustness to market perturbations;
  • financial logic;
  • diversity.

How It Relates To AlphaBench

  • AlphaBench = benchmark for LLM capabilities in formulaic alpha mining.
  • AlphaEval = evaluation framework for generated formula alphas.

How It Relates To Crypto-Alpha-Bench

Crypto-Alpha-Bench can use AlphaEval-like dimensions, then add:

  • fixed public crypto data;
  • explicit cost tiers;
  • fill/tradability gates;
  • compute-control;
  • synthetic ground truth;
  • DSR/PBO;
  • human expert baseline.

Soundbite

AlphaEval gives the evaluation axes; Crypto-Alpha-Bench gives the executable benchmark substrate.


6. What Crypto-Alpha-Bench Should Claim

6.1 Claims To Avoid

Avoid:

  • "first alpha mining benchmark";
  • "first LLM financial agent benchmark";
  • "first alpha auto search framework";
  • "first financial trading agent benchmark."

Those are too broad and likely false after AlphaBench / RD-Agent(Q) / AlphaEval.

6.2 Defensible Claim

Use:

Crypto-Alpha-Bench is a benchmark for executable crypto mid-frequency alpha search, designed to evaluate not only formula quality but cost-adjusted, fill-aware, statistically corrected, compute-controlled discovery.

Even sharper:

Existing benchmarks test whether LLMs can generate alpha formulas. Crypto-Alpha-Bench tests whether alpha-search systems can discover tradable crypto alphas under realistic execution and statistical constraints.

6.3 Minimum Differentiators

Do not launch Crypto-Alpha-Bench without these:

  1. Fixed crypto perp dataset
  2. e.g. Binance USD-M top-N perps;
  3. versioned manifest;
  4. gap handling protocol.

  5. Three cost tiers

  6. optimistic;
  7. realistic;
  8. pessimistic;
  9. all leaderboard submissions report all three.

  10. Fill / tradability gate

  11. spread;
  12. top-of-book notional;
  13. depth;
  14. adverse selection proxy;
  15. partial-fill handling.

  16. Statistical rigor

  17. DSR;
  18. PBO / CSCV;
  19. null-search baseline.

  20. Compute-control

  21. token budget;
  22. wall-clock;
  23. GPU/CPU tier;
  24. number of candidate evaluations.

  25. Synthetic ground truth

  26. known alpha process;
  27. known regime shift;
  28. known execution-cost sensitivity.

  29. Reference baselines

  30. random;
  31. gplearn / GP;
  32. AlphaBench-style CoE/ToT/EA;
  33. AlphaAgent regularized LLM;
  34. LLM+MCTS;
  35. GFlowNet / AlphaSAGE-style;
  36. your M8.6 tradability baseline.

Optional but distinctive:

  1. Human expert discretionary baseline
  2. only if trader cooperation is realistic.

7.1 Stage 1 · Close Reading Priority

Read in this order:

  1. AlphaBench
  2. RD-Agent(Q)
  3. Hubble
  4. FactorMiner
  5. QuantaAlpha
  6. CogAlpha
  7. AlphaAgent
  8. Navigating the Alpha Jungle
  9. AlphaSAGE
  10. AlphaEval
  11. Beyond Prompting
  12. Alpha-GPT
  13. QuantAgent
  14. TradingAgents
  15. FinMem / FinAgent

7.2 Stage 2 · Extraction Template

For each paper/system:

Title:
Year / venue:
Task:
Search unit:
Generator:
Verifier:
Feedback granularity:
Data:
Cost model:
Statistical rigor:
Compute control:
Reproducibility:
Main reported result:
Main weakness:
What Crypto-Alpha-Bench should borrow:
What Crypto-Alpha-Bench should beat:

7.3 Stage 3 · Benchmark Design Decision

After reading:

  • if AlphaBench already covers enough of formula mining, narrow Crypto-Alpha-Bench to executable crypto alphas;
  • if RD-Agent(Q) is too strong as a system, avoid competing as "agent architecture"; compete as "hard evaluation environment";
  • if QuantaAlpha / AlphaAgent / MCTS outperform simple baselines, include them as reference baselines rather than reinvent them.

8. Updated HKU Talk Adjustment

Add one slide before "The Field Has No ImageNet Moment":

Slide: Existing SOTA Is Close, But Not The Same

Existing Work Covers Missing For My Goal
AlphaBench LLM formula alpha benchmark executable crypto cost/fill/statistics
RD-Agent(Q) full quant R&D agent benchmark substrate
Hubble / FactorMiner / CogAlpha safe generation, memory, code evolution common executable crypto protocol
AlphaAgent / QuantaAlpha / Alpha Jungle search methods fixed benchmark discipline
Alpha-GPT human-AI interactive alpha mining controlled human/agent comparison
TradingAgents / QuantAgent trading decisions alpha-search benchmark
AlphaEval evaluation dimensions fixed benchmark + tradability

Then say:

"So the claim is not that nothing exists. The claim is that these systems stop before the executable crypto-alpha layer I need."

This makes the proposal much more robust.


9. Final Positioning

Best one-sentence positioning:

Crypto-Alpha-Bench is not another LLM alpha-mining agent. It is the executable verifier and benchmark layer that current alpha-mining agents lack.

Best professor-facing version:

"I want to benchmark not just whether an agent can produce a plausible factor, but whether the factor survives costs, fills, time-slice instability, multiple testing, and compute-controlled comparison in crypto markets."


10. Source Index

  • AlphaBench project: https://alphabench.cc/
  • AlphaBench OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ
  • RD-Agent(Q) Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
  • RD-Agent(Q) arXiv: https://arxiv.org/abs/2505.15155
  • QuantaAlpha: https://arxiv.org/abs/2602.07085
  • AlphaAgent: https://arxiv.org/abs/2502.16789
  • Navigating the Alpha Jungle: https://arxiv.org/abs/2505.11122
  • AlphaSAGE: https://arxiv.org/abs/2509.25055
  • Chain-of-Alpha: https://arxiv.org/abs/2508.06312
  • Hubble: https://arxiv.org/abs/2604.09601
  • FactorMiner: https://arxiv.org/abs/2602.14670
  • CogAlpha: https://arxiv.org/abs/2511.18850
  • Beyond Prompting: https://arxiv.org/abs/2603.14288
  • Alpha-GPT ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
  • Alpha-GPT arXiv: https://arxiv.org/abs/2308.00016
  • TradingAgents: https://arxiv.org/abs/2412.20138
  • QuantAgent: https://arxiv.org/abs/2509.09995
  • FinMem: https://arxiv.org/abs/2311.13743
  • FinAgent: https://arxiv.org/abs/2402.18485
  • FinRobot: https://arxiv.org/abs/2405.14767
  • AlphaEval: https://arxiv.org/abs/2508.13174