Financial SOTA Agent Survey for Crypto-Alpha-Bench¶

Purpose: map current financial SOTA agents / alpha-mining systems before committing to Crypto-Alpha-Bench.

Updated: 2026-05-19

Core question: What existing agents and benchmarks already cover alpha auto search, and what gap remains for Crypto-Alpha-Bench?

0. Executive Verdict¶

Yes: before positioning Crypto-Alpha-Bench, we must survey current financial SOTA agents and alpha-mining benchmarks.

The important conclusion is not "nobody has done this." The important conclusion is:

Existing work already covers formula alpha generation, factor evaluation, iterative LLM search, multi-agent quant R&D, and trading-decision agents. Crypto-Alpha-Bench must therefore position itself as an executable crypto mid-frequency alpha benchmark with fixed data, cost tiers, fill/tradability constraints, DSR/PBO, compute control, synthetic ground truth, and optional human expert baseline.

The biggest direct threats:

AlphaBench: already claims the first systematic benchmark for LLMs in formulaic alpha factor mining.
RD-Agent(Q): already claims full-stack multi-agent quant R&D with factor-model joint optimization.
QuantaAlpha / AlphaAgent / Alpha Jungle: already cover LLM-driven formula alpha search with increasingly sophisticated exploration and regularization.
Hubble / FactorMiner / CogAlpha / Beyond Prompting: 2025-2026 work pushes toward safer sandboxed generation, memory-driven self-evolution, code-level evolution, and autonomous factor investing.

The clean gap:

Most existing systems optimize formula alpha quality or trading-decision performance. They do not jointly enforce crypto-specific executable-alpha constraints: public crypto perp data, cost tiers, fill/tradability gates, compute-controlled search, multiple-testing correction, synthetic ground truth, and human discretionary baseline.

1. Taxonomy: What Counts As "Financial Agent SOTA"?¶

Use four buckets. Do not mix them.

Bucket	What It Optimizes	Typical Output	Direct Threat To Crypto-Alpha-Bench?
A. Alpha Mining Agents	Discover formulaic alphas / factors	alpha expression, factor pool	High
B. Full Quant R&D Agents	Automate research workflow: hypothesis → code → backtest → feedback	factor + model + report	High
C. Trading Decision Agents	Decide buy/sell/hold or portfolio action from multimodal context	trading action	Medium
D. Benchmark / Evaluation Frameworks	Standardize tasks and metrics	leaderboard / protocol	High

Crypto-Alpha-Bench should mostly compare against A/B/D, while using C as neighboring context.

2. Benchmark Gap Matrix¶

This is the single most important table.

System	Task	Search Unit	Verifier	Data / Market	Cost / Fill	Multiple Testing	Compute Control	Main Gap For Crypto-Alpha-Bench
AlphaBench	LLM formulaic alpha mining benchmark	formula expression	Qlib backtest + factor metrics	CSI300 + SP500, 2020-2025	not the central axis	not central	analyzes model/settings, but not a strict compute-budget benchmark	no crypto perp executable-alpha protocol
RD-Agent(Q)	full quant R&D automation	factor + model code	real-market backtests + feedback stage	stock markets	not the central axis	not emphasized	MAB scheduling, but benchmark compute control not the core	strong agent; weak public benchmark substrate
QuantaAlpha	LLM-driven evolutionary alpha mining	mining trajectory + factor	backtest / IC / return metrics	CSI300 + transfer to CSI500/SP500	not central	not central	trajectory reuse, but not fixed budget protocol	trajectory search is useful baseline, not full benchmark
AlphaAgent	decay-resistant LLM alpha mining	formula expression	market evaluation across regimes	CSI500 + SP500	not central	regularizes decay but not DSR/PBO	not core	useful regularization baseline
Alpha Jungle	LLM + MCTS formula mining	formula expression / tree	backtest feedback	stock market data	not central	not central	MCTS budget implicit	search baseline, not benchmark protocol
AlphaSAGE	GFlowNet alpha mining	expression graph / factor portfolio	multi-faceted reward	stock alpha mining	not central	not central	GFlowNet exploration budget	strong non-LLM alpha-search baseline
Hubble	safe/reproducible LLM alpha discovery	operator tree	AST sandbox + cross-sectional metrics	U.S. equities	turnover included, but fill/cost not core	OOS/HAC evidence, not DSR/PBO benchmark	fixed rounds/candidate counts	closest to safe verifier framing, still equity/formula-centric
FactorMiner	self-evolving alpha discovery	formulaic factor + memory	modular evaluation tools	multi-asset datasets	not central	redundancy/correlation control, not DSR/PBO	lightweight iterative loop	strong memory baseline, not crypto executable benchmark
CogAlpha	LLM-driven code evolution	executable alpha code	evolutionary fitness feedback	A-share equities	not central	robustness/generalization claimed, not benchmark protocol	evolutionary budget implicit	useful code-evolution baseline
Beyond Prompting	autonomous systematic factor investing	interpretable signal set	OOS validation + economic rationale	U.S. equities	not central	data-snooping addressed qualitatively	not core	strong autonomous-agent story, but not public benchmark substrate
Alpha-GPT	human-AI interactive alpha mining	human idea -> alpha	alpha mining experiments / competition evaluation	WorldQuant-style alpha context	not central	not central	interactive workflow, not controlled budget	useful human-in-loop baseline
TradingAgents	collaborative trading-decision agents	buy/sell/trade decision	portfolio returns, Sharpe, MDD	stocks	not central	not central	not core	trading chatbot/team, not alpha benchmark
QuantAgent	short-horizon / HFT-style LLM trading	structured-signal decision	predictive accuracy + trading metrics	9 instruments incl. BTC/Nasdaq futures	not rigorous fill sim	not central	not core	relevant to crypto HFT framing, but not formula alpha mining
FinMem	memory-augmented LLM trading	investment decision	stock trading performance	stocks	not central	not central	not core	memory architecture reference, not benchmark
FinAgent	multimodal foundation trading agent	trading action	6 financial metrics	stocks + crypto datasets	not central	not central	not core	multimodal trading baseline, not alpha search
AlphaEval	alpha evaluation framework	generated alpha	5D backtest-free evaluation	formula alphas	cost/tradability not central	not central in abstract	not core	evaluation backbone, not benchmark substrate

Result:

The benchmark claim must not be "first alpha benchmark." It should be "first executable crypto mid-frequency alpha-search benchmark with cost/fill/statistical rigor."

3. Direct Competitors / Must-Read Systems¶

3.1 AlphaBench¶

Sources:

Project: https://alphabench.cc/
OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ

What It Is¶

AlphaBench is an ICLR 2026 benchmark for evaluating LLMs in Formulaic Alpha Factor Mining (FAFM). It covers three core tasks:

factor generation;
factor evaluation;
iterative factor searching.

The project page says its toolchain includes 1,857 instructions, an FFO execution engine, and Qlib-based backtesting. It evaluates generation settings and search paradigms such as Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms.

Architecture / Task Design¶

Component	Details
Search unit	formulaic alpha factor expression
Generator	LLMs under different prompting/reasoning/search settings
Verifier	executable formula engine + Qlib backtesting
Tasks	Text2Alpha, Directional Mining, FactorEval, CoE, ToT, EA
Dataset	CSI300 and SP500, 2020-2025 according to project page
Metrics	IC, RankIC, robustness, win rate, skewness, reliability, stability, semantic alignment

What It Solves¶

AlphaBench is the strongest existing answer to:

"Can LLMs generate, evaluate, and search formulaic alphas?"

It gives a standardized FAFM benchmark and should be treated as the closest benchmark sibling.

What It Does Not Solve For Us¶

Crypto-Alpha-Bench should differ on:

crypto perpetual futures instead of stock FAFM only;
mid-frequency executable alpha, not just formula validity / stock factor quality;
explicit transaction-cost tiers;
fill/tradability gate;
DSR/PBO / multiple-testing discipline;
compute-controlled search budgets;
synthetic ground-truth task;
optional human expert discretionary baseline.

How To Position Against It¶

Say:

AlphaBench is the closest existing benchmark for LLM formula alpha mining. Crypto-Alpha-Bench should adopt its task decomposition, but extend the benchmark target from formula quality to executable crypto mid-frequency alpha under cost, fill, and statistical constraints.

Do not say:

"There is no alpha benchmark."

That is no longer defensible.

3.2 RD-Agent(Q)¶

Sources:

Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
arXiv: https://arxiv.org/abs/2505.15155
Docs: https://rdagent.readthedocs.io/en/latest/scens/quant_agent_fin.html

What It Is¶

RD-Agent(Q) is a data-centric multi-agent framework for automated quant R&D. It targets factor-model joint optimization, not only formula alpha mining.

From the Microsoft abstract, it decomposes quant research into:

a Research stage that sets goal-aligned prompts, formulates hypotheses from domain priors, and maps them to tasks;
a Development stage using a code-generation agent, Co-STEER, to implement task-specific code;
a Feedback stage that evaluates real-market backtests and informs later iterations;
a multi-armed bandit scheduler for adaptive direction selection.

Architecture / Task Design¶

Component	Details
Search unit	factor ideas, model innovations, code
Generator	multi-agent hypothesis and code-generation workflow
Verifier	executable code + real-market backtests
Feedback	analysis unit + MAB scheduler
Knowledge grounding	domain priors / knowledge forest / prior outcomes
Claim	up to 2x annualized return vs classical factor libraries with fewer factors, according to abstract

What It Solves¶

RD-Agent(Q) is the strongest "full quant R&D automation" baseline.

It addresses:

hypothesis generation;
code implementation;
backtest verification;
factor-model co-design;
adaptive allocation of research effort.

What It Does Not Solve For Us¶

It is not primarily a public benchmark protocol:

no crypto-specific executable benchmark;
no explicit public cost-tier/fill/tradability protocol as core contribution;
no central DSR/PBO benchmark claim;
no human expert discretionary baseline;
no fixed leaderboard framing.

How To Position Against It¶

Say:

RD-Agent(Q) is a model of what the research agent could become. Crypto-Alpha-Bench is the evaluation substrate such an agent would need in crypto mid-frequency settings.

In other words:

RD-Agent(Q) = strong agent architecture.
Crypto-Alpha-Bench = hard verifier / benchmark environment.

3.3 QuantaAlpha¶

Source:

arXiv: https://arxiv.org/abs/2602.07085

What It Is¶

QuantaAlpha is an evolutionary LLM-driven alpha mining framework. It treats each end-to-end mining run as a trajectory and improves factors using trajectory-level mutation and crossover.

According to the abstract, it:

localizes suboptimal steps for targeted revision;
recombines high-reward trajectory segments;
enforces semantic consistency across hypothesis, factor expression, and executable code;
constrains complexity and redundancy to reduce crowding.

Architecture / Task Design¶

Component	Details
Search unit	trajectory of a mining run, not just a formula
Generator	LLM-driven factor generation + evolutionary mutation/crossover
Verifier	IC / ARR / MDD / backtest-style metrics
Data	CSI300, with transfer claims to CSI500 and S&P500 according to abstract
Novelty	reuse validated experience at trajectory level

Why It Matters¶

This is the natural next step after AlphaAgent / Alpha Jungle:

AlphaAgent regularizes individual factor generation.
Alpha Jungle improves tree search.
QuantaAlpha improves whole search trajectories.

For your benchmark, QuantaAlpha should be a strong search-agent baseline.

Gap¶

Still not enough for Crypto-Alpha-Bench:

equity-centric;
not a public executable crypto benchmark;
transaction cost / fill / capacity not central;
multiple-testing discipline not central;
compute-control not the main object.

How To Use It¶

In Crypto-Alpha-Bench:

include a QuantaAlpha-style trajectory search baseline if implementation is available;
require it to report compute budget and DSR/PBO;
compare its trajectory reuse against simpler EA/ToT/CoE search under equal budget.

3.4 AlphaAgent¶

Source:

arXiv: https://arxiv.org/abs/2502.16789

What It Is¶

AlphaAgent is an autonomous LLM-driven alpha mining framework focused on alpha decay resistance.

The abstract highlights three mechanisms:

AST-based originality enforcement against existing alphas;
LLM-evaluated hypothesis-factor semantic alignment;
AST-based complexity control to prevent over-engineered formulas.

Architecture / Task Design¶

Component	Details
Search unit	formulaic alpha expression
Generator	LLM agent
Regularizer	originality, semantic alignment, complexity
Verifier	market performance across CSI500 and S&P500 settings
Main goal	reduce alpha decay / crowding / homogeneity

Why It Matters¶

AlphaAgent is the best baseline for:

"Can LLMs generate less crowded, more interpretable, decay-resistant formula alphas?"

It is directly relevant to your Cognition Base / Red Queen idea.

Gap¶

AlphaAgent regularizes factor generation, but it does not solve:

benchmark fixed dataset/protocol;
crypto execution;
fill/tradability constraints;
strict DSR/PBO;
cost-tier reporting.

How To Use It¶

Crypto-Alpha-Bench should include AlphaAgent-style regularization as one baseline axis:

no regularization;
AST originality only;
semantic alignment only;
complexity control only;
all combined.

Then test which survives crypto mid-frequency execution costs.

3.5 Navigating the Alpha Jungle¶

Source:

arXiv: https://arxiv.org/abs/2505.11122

What It Is¶

This paper integrates LLMs with Monte Carlo Tree Search for formulaic factor mining.

The abstract's key mechanisms:

LLM generates/refines symbolic alpha formulas;
MCTS explores the formula search space;
quantitative backtest feedback guides search;
frequent subtree avoidance improves diversity and avoids homogenization.

Architecture / Task Design¶

Component	Details
Search unit	symbolic formula tree
Generator	LLM prior
Search	MCTS
Verifier	backtest feedback
Anti-collapse	frequent subtree avoidance

Why It Matters¶

This is the most direct "AlphaProof-style" migration into alpha mining:

LLM prior;
tree search;
external verifier feedback.

Gap¶

It still lacks:

unified benchmark protocol;
crypto execution/fill/cost emphasis;
formal multiple-testing reporting;
compute-budget fairness across search algorithms.

How To Use It¶

Crypto-Alpha-Bench should include an LLM+MCTS baseline:

same primitive set;
same expression grammar;
same expression budget;
same cost-tier evaluation;
compare against random, EA, GFlowNet, QuantaAlpha-style trajectory search.

3.6 AlphaSAGE¶

Source:

arXiv: https://arxiv.org/abs/2509.25055

What It Is¶

AlphaSAGE uses GFlowNets for structure-aware alpha mining. It addresses three problems in RL alpha generation:

sparse rewards at formula completion;
inadequate sequential representation of expression structure;
single-mode optimization, which conflicts with the need for diverse non-correlated alphas.

The abstract lists three innovations:

RGCN-based structure-aware encoder;
GFlowNet generation;
dense multi-faceted reward.

Architecture / Task Design¶

Component	Details
Search unit	expression graph / formula structure
Generator	GFlowNet
Encoder	RGCN structure-aware encoder
Objective	diverse high-reward modes
Verifier	alpha quality rewards

Why It Matters¶

For Crypto-Alpha-Bench, AlphaSAGE is important because not all strong alpha search baselines are LLM agents.

It is a strong baseline for:

diverse factor portfolio generation.

Gap¶

It does not address:

LLM agent workflow;
executable crypto costs/fill;
benchmark protocol;
DSR/PBO;
human expert baseline.

How To Use It¶

Include a GFlowNet baseline in the benchmark after v0:

if it beats LLM search under equal compute, that is a strong negative result against LLM-first alpha mining;
if LLM+GFlowNet hybrid wins, that motivates Cognition Base + structure-aware exploration.

3.7 Chain-of-Alpha¶

Source:

arXiv: https://arxiv.org/abs/2508.06312

Important status:

The arXiv page says this paper has been withdrawn / removed by administrators due to license-right issues.

What It Claimed¶

The abstract describes a dual-chain architecture:

Factor Generation Chain;
Factor Optimization Chain;
iterative generate-evaluate-refine loop using market data, backtest feedback, and prior optimization knowledge.

How To Treat It¶

Do not cite this as a stable SOTA claim in a professor-facing deck.

Use it only as:

evidence that the field is converging on iterative LLM alpha search loops.

Better stable substitutes:

AlphaBench;
AlphaAgent;
Alpha Jungle;
QuantaAlpha.

3.8 Hubble¶

Source:

arXiv: https://arxiv.org/abs/2604.09601

What It Is¶

Hubble is a 2026 LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery.

Its core design is highly relevant to your verifier thesis:

constrain generation with a domain-specific operator language;
execute formulas through an AST sandbox instead of arbitrary code;
use dual-channel RAG and family-aware selection;
score candidates through a deterministic cross-sectional pipeline;
feed back top formulas and structured diagnostics into later rounds.

The current arXiv abstract reports a U.S. equity universe of roughly 500 stocks, 104 valid candidates across three rounds, zero runtime crashes, and held-out validation from 2025-06-01 to 2026-03-13.

Why It Matters¶

Hubble weakens any naive claim that "existing LLM alpha systems do not care about safety or reproducibility." It explicitly cares about:

executable safety;
formula validity;
diversity;
interpretability;
post-hoc diagnostics;
held-out validation.

This is close to the language you want for Crypto-Alpha-Bench.

Gap For Crypto-Alpha-Bench¶

Hubble is still primarily:

equity/formula factor discovery;
daily/cross-sectional rather than crypto mid-frequency;
not centered on fill simulation, transaction-cost tiers, capacity, DSR/PBO, or leaderboard-style benchmark control.

How To Use It¶

Use Hubble as a safe generation baseline:

"Hubble shows the right direction for safe LLM factor generation. Crypto-Alpha-Bench asks whether such safe generated factors survive executable crypto trading constraints."

3.9 FactorMiner¶

Source:

arXiv: https://arxiv.org/abs/2602.14670

What It Is¶

FactorMiner is a self-evolving alpha discovery agent built around skills and experience memory.

The key loop is the Ralph Loop:

retrieve prior experience;
generate factor candidates;
evaluate candidates through modular tools;
distill successful patterns and failure constraints back into memory.

The abstract emphasizes the "Correlation Red Sea" problem: as a factor library grows, new factors become increasingly redundant. FactorMiner tries to reduce redundant search using accumulated memory and modular evaluation tools.

Why It Matters¶

This is highly aligned with your Cognition Base idea.

FactorMiner is basically saying:

Alpha discovery needs not just better prompts, but memory over prior trials and systematic distillation of what worked and failed.

That overlaps with your hypothesis that a replication-aware financial Cognition Base may be the hidden variable behind compute-scaled discovery.

Gap For Crypto-Alpha-Bench¶

FactorMiner is not enough by itself because:

it optimizes discovery workflow, not public benchmark infrastructure;
cost/fill/tradability are not the central contribution;
DSR/PBO are not the main evaluation object;
crypto perpetual microstructure is not the target substrate.

How To Use It¶

Use FactorMiner as:

a memory-augmented alpha-search baseline;
a literature anchor for your Cognition Base RQ;
a reason to make your benchmark record full search trajectories, not just final formulas.

3.10 CogAlpha¶

Source:

arXiv: https://arxiv.org/abs/2511.18850

What It Is¶

CogAlpha, from Cognitive Alpha Mining via LLM-Driven Code-Based Evolution, combines:

code-level alpha representation;
LLM-driven reasoning;
evolutionary mutation and recombination;
financial feedback over generated alpha candidates.

The paper positions formula-only and neural approaches as too narrow, opaque, redundant, or economically ungrounded, then argues for broader structured exploration using LLMs as adaptive cognitive agents.

Why It Matters¶

CogAlpha is a direct competitor if Crypto-Alpha-Bench includes code-producing agents.

It moves beyond "LLM writes formula" toward:

executable code as the search unit;
evolutionary refinement;
interpretability through readable strategy logic;
broader search-space coverage.

Gap For Crypto-Alpha-Bench¶

The gap is again benchmark discipline:

A-share equity setting rather than crypto perpetuals;
no fixed public crypto task protocol;
no explicit fill/cost/capacity standard;
no compute-controlled leaderboard;
no central multiple-testing correction claim.

How To Use It¶

Treat CogAlpha as a code-evolution baseline. It is especially useful if your benchmark allows agents to submit runnable strategy code rather than only formula expressions.

3.11 Beyond Prompting¶

Source:

arXiv: https://arxiv.org/abs/2603.14288

What It Is¶

Beyond Prompting: An Autonomous Framework for Systematic Factor Investing via Agentic AI develops a self-directed agentic framework for systematic factor investing.

The abstract highlights:

autonomous formulation of interpretable trading signals;
out-of-sample validation;
economic rationale requirements;
U.S. equity long-short portfolios;
reported annualized Sharpe ratio of 3.11 and return of 59.53%.

Why It Matters¶

This is important for professor Q&A because it may sound very close to "agentic quant research."

It explicitly tries to move from manual prompting to a self-directed engine.

Gap For Crypto-Alpha-Bench¶

The main weakness for your agenda is that it is not a benchmark environment:

it presents an autonomous factor-investing framework;
it does not define a reusable crypto alpha-search leaderboard;
transaction-cost/fill/capacity modeling is not the core public protocol;
it does not solve cross-agent comparability.

How To Use It¶

Position it as an agent architecture competitor, not a benchmark competitor.

Say:

"Autonomous factor-investing agents are arriving. That makes a hard, public verifier even more urgent."

3.12 Alpha-GPT¶

Sources:

ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
arXiv: https://arxiv.org/abs/2308.00016

What It Is¶

Alpha-GPT is a human-AI interactive alpha mining system, published at EMNLP 2025 System Demonstrations.

It introduces a workflow where the human quant researcher supplies or iterates on ideas, and the LLM framework turns those ideas into candidate alphas. The ACL abstract reports that Alpha-GPT ranked top-10 among over 41,000 teams in the WorldQuant International Quant Championship.

Why It Matters¶

This is the best reference for your human expert in the loop angle.

It does not claim that the agent should replace the quant researcher. Instead, it frames LLMs as a way to implement and expand human alpha hypotheses.

That is close to your revised plan:

compare LLM auto-search against discretionary human experts, then study hybrid loops.

Gap For Crypto-Alpha-Bench¶

Alpha-GPT is:

interactive rather than benchmark-controlled;
not crypto-specific;
not centered on cost/fill/statistical correction;
hard to compare under fixed compute budgets because human interaction is part of the loop.

How To Use It¶

Use Alpha-GPT to justify a human-in-loop track:

autonomous track: agent gets dataset + API + budget;
assisted track: human expert can steer agent;
human-only track: discretionary researcher baseline.

This gives the benchmark a richer and more realistic comparison.

4. Trading Decision Agents / Neighboring Systems¶

These are not direct formula-alpha-search baselines, but they matter because professors may ask whether your benchmark is really about trading agents rather than alpha mining.

4.1 TradingAgents¶

Source:

arXiv: https://arxiv.org/abs/2412.20138

What It Is¶

TradingAgents is a multi-agent LLM trading framework inspired by a trading firm. The abstract lists specialized roles:

fundamental analysts;
sentiment analysts;
technical analysts;
bull/bear researchers;
risk management team;
traders with varied risk profiles.

Why It Matters¶

TradingAgents is a strong example of:

collaborative LLM trading-decision workflow.

It is useful for RQ2 / open-world agent safety and human-readable decision process.

Why It Is Not Your Main Baseline¶

It outputs trading decisions, not formulaic alpha factors or benchmarkable alpha-search trajectories.

For Crypto-Alpha-Bench:

include as neighboring context;
not a primary formula-mining baseline.

4.2 QuantAgent¶

Source:

arXiv: https://arxiv.org/abs/2509.09995

What It Is¶

QuantAgent is a price-driven multi-agent LLM framework for short-horizon / HFT-style trading. The abstract says it decomposes trading into four agents:

Indicator;
Pattern;
Trend;
Risk.

It targets structured short-horizon signals rather than long-horizon text/fundamental reasoning.

Why It Matters¶

QuantAgent is the closest neighboring system to your crypto mid-frequency / microstructure framing.

It explicitly criticizes long-horizon LLM trading agents as ill-suited for high-speed precision-critical trading, which overlaps with your concern.

Gap¶

But according to the abstract:

evaluation focuses on predictive accuracy and trading metrics across 1-hour and 4-hour intervals;
it is not an alpha-search benchmark;
fill / queue / adverse selection / DSR/PBO are not central;
"HFT" is used broadly; it does not solve sub-second execution realism.

How To Use It¶

Use QuantAgent as a neighboring baseline / framing contrast:

QuantAgent says structured signals matter for short-horizon trading. Crypto-Alpha-Bench turns that into a fixed benchmark with cost/fill/statistical constraints.

4.3 FinMem¶

Source:

arXiv: https://arxiv.org/abs/2311.13743

What It Is¶

FinMem is a memory-enhanced LLM trading agent. The abstract lists three modules:

Profiling;
layered Memory;
Decision-making.

It aims to imitate aspects of human trader cognition and improve stock trading outcomes.

Why It Matters¶

FinMem is relevant to the human expert in the loop revision:

layered memory;
trader-like cognitive structure;
self-evolving professional knowledge;
real-time tuning.

Gap¶

It is not primarily:

formula alpha search;
benchmark infrastructure;
crypto execution/fill evaluation;
multiple-testing-corrected alpha discovery.

How To Use It¶

Use it as a baseline for:

memory architecture;
tacit-knowledge extraction;
discretionary-agent imitation.

Do not use it as a core Crypto-Alpha-Bench alpha-mining baseline.

4.4 FinAgent¶

Source:

arXiv: https://arxiv.org/abs/2402.18485

What It Is¶

FinAgent is a multimodal foundation agent for financial trading. The abstract describes:

tool augmentation;
multimodal market intelligence over numerical, textual, and visual data;
dual-level reflection;
diversified memory retrieval;
reasoning for actions;
integration of trading strategies and expert insights.

Why It Matters¶

FinAgent is the strongest representative of:

multimodal generalist financial trading agents.

It helps position the human-expert-in-loop and multimodal context-window direction.

Gap¶

FinAgent is broad; Crypto-Alpha-Bench should be narrower:

fixed crypto mid-frequency data;
executable-alpha metrics;
cost and fill realism;
factor-search protocol;
statistical rigor.

4.5 FinRobot¶

Source:

arXiv: https://arxiv.org/abs/2405.14767

What It Is¶

FinRobot is an open-source AI agent platform for financial applications using LLMs. It is a platform rather than a single alpha-mining model.

The abstract describes layers:

Financial AI Agents;
Financial LLM Algorithms;
LLMOps and DataOps;
Multi-source LLM foundation models.

Why It Matters¶

FinRobot is useful as:

platform reference;
LLMOps/DataOps architecture reference;
not a direct competitor benchmark.

Gap¶

It does not define the executable crypto alpha benchmark you need.

5. Evaluation Frameworks¶

5.1 AlphaEval¶

Source:

arXiv: https://arxiv.org/abs/2508.13174

What It Is¶

AlphaEval is a comprehensive and efficient evaluation framework for formula alpha mining.

Its abstract emphasizes:

backtesting is expensive and sequential;
single metrics are incomplete;
five-dimensional evaluation:
predictive power;
stability;
robustness to market perturbations;
financial logic;
diversity.

How It Relates To AlphaBench¶

AlphaBench = benchmark for LLM capabilities in formulaic alpha mining.
AlphaEval = evaluation framework for generated formula alphas.

How It Relates To Crypto-Alpha-Bench¶

Crypto-Alpha-Bench can use AlphaEval-like dimensions, then add:

fixed public crypto data;
explicit cost tiers;
fill/tradability gates;
compute-control;
synthetic ground truth;
DSR/PBO;
human expert baseline.

Soundbite¶

AlphaEval gives the evaluation axes; Crypto-Alpha-Bench gives the executable benchmark substrate.

6. What Crypto-Alpha-Bench Should Claim¶

6.1 Claims To Avoid¶

Avoid:

"first alpha mining benchmark";
"first LLM financial agent benchmark";
"first alpha auto search framework";
"first financial trading agent benchmark."

Those are too broad and likely false after AlphaBench / RD-Agent(Q) / AlphaEval.

6.2 Defensible Claim¶

Use:

Crypto-Alpha-Bench is a benchmark for executable crypto mid-frequency alpha search, designed to evaluate not only formula quality but cost-adjusted, fill-aware, statistically corrected, compute-controlled discovery.

Even sharper:

Existing benchmarks test whether LLMs can generate alpha formulas. Crypto-Alpha-Bench tests whether alpha-search systems can discover tradable crypto alphas under realistic execution and statistical constraints.

6.3 Minimum Differentiators¶

Do not launch Crypto-Alpha-Bench without these:

Fixed crypto perp dataset
e.g. Binance USD-M top-N perps;
versioned manifest;
gap handling protocol.
Three cost tiers
optimistic;
realistic;
pessimistic;
all leaderboard submissions report all three.
Fill / tradability gate
spread;
top-of-book notional;
depth;
adverse selection proxy;
partial-fill handling.
Statistical rigor
DSR;
PBO / CSCV;
null-search baseline.
Compute-control
token budget;
wall-clock;
GPU/CPU tier;
number of candidate evaluations.
Synthetic ground truth
known alpha process;
known regime shift;
known execution-cost sensitivity.
Reference baselines
random;
gplearn / GP;
AlphaBench-style CoE/ToT/EA;
AlphaAgent regularized LLM;
LLM+MCTS;
GFlowNet / AlphaSAGE-style;
your M8.6 tradability baseline.

Optional but distinctive:

Human expert discretionary baseline
only if trader cooperation is realistic.

7. Recommended Survey Workplan¶

7.1 Stage 1 · Close Reading Priority¶

Read in this order:

AlphaBench
RD-Agent(Q)
Hubble
FactorMiner
QuantaAlpha
CogAlpha
AlphaAgent
Navigating the Alpha Jungle
AlphaSAGE
AlphaEval
Beyond Prompting
Alpha-GPT
QuantAgent
TradingAgents
FinMem / FinAgent

7.2 Stage 2 · Extraction Template¶

For each paper/system:

Title:
Year / venue:
Task:
Search unit:
Generator:
Verifier:
Feedback granularity:
Data:
Cost model:
Statistical rigor:
Compute control:
Reproducibility:
Main reported result:
Main weakness:
What Crypto-Alpha-Bench should borrow:
What Crypto-Alpha-Bench should beat:

7.3 Stage 3 · Benchmark Design Decision¶

After reading:

if AlphaBench already covers enough of formula mining, narrow Crypto-Alpha-Bench to executable crypto alphas;
if RD-Agent(Q) is too strong as a system, avoid competing as "agent architecture"; compete as "hard evaluation environment";
if QuantaAlpha / AlphaAgent / MCTS outperform simple baselines, include them as reference baselines rather than reinvent them.

8. Updated HKU Talk Adjustment¶

Add one slide before "The Field Has No ImageNet Moment":

Slide: Existing SOTA Is Close, But Not The Same¶

Existing Work	Covers	Missing For My Goal
AlphaBench	LLM formula alpha benchmark	executable crypto cost/fill/statistics
RD-Agent(Q)	full quant R&D agent	benchmark substrate
Hubble / FactorMiner / CogAlpha	safe generation, memory, code evolution	common executable crypto protocol
AlphaAgent / QuantaAlpha / Alpha Jungle	search methods	fixed benchmark discipline
Alpha-GPT	human-AI interactive alpha mining	controlled human/agent comparison
TradingAgents / QuantAgent	trading decisions	alpha-search benchmark
AlphaEval	evaluation dimensions	fixed benchmark + tradability

Then say:

"So the claim is not that nothing exists. The claim is that these systems stop before the executable crypto-alpha layer I need."

This makes the proposal much more robust.

9. Final Positioning¶

Best one-sentence positioning:

Crypto-Alpha-Bench is not another LLM alpha-mining agent. It is the executable verifier and benchmark layer that current alpha-mining agents lack.

Best professor-facing version:

"I want to benchmark not just whether an agent can produce a plausible factor, but whether the factor survives costs, fills, time-slice instability, multiple testing, and compute-controlled comparison in crypto markets."

10. Source Index¶

AlphaBench project: https://alphabench.cc/
AlphaBench OpenReview: https://openreview.net/forum?id=d97Q8r7ZKZ
RD-Agent(Q) Microsoft Research: https://www.microsoft.com/en-us/research/publication/rd-agent-quant-a-multi-agent-framework-for-data-centric-factors-and-model-joint-optimization/
RD-Agent(Q) arXiv: https://arxiv.org/abs/2505.15155
QuantaAlpha: https://arxiv.org/abs/2602.07085
AlphaAgent: https://arxiv.org/abs/2502.16789
Navigating the Alpha Jungle: https://arxiv.org/abs/2505.11122
AlphaSAGE: https://arxiv.org/abs/2509.25055
Chain-of-Alpha: https://arxiv.org/abs/2508.06312
Hubble: https://arxiv.org/abs/2604.09601
FactorMiner: https://arxiv.org/abs/2602.14670
CogAlpha: https://arxiv.org/abs/2511.18850
Beyond Prompting: https://arxiv.org/abs/2603.14288
Alpha-GPT ACL Anthology: https://aclanthology.org/2025.emnlp-demos.14/
Alpha-GPT arXiv: https://arxiv.org/abs/2308.00016
TradingAgents: https://arxiv.org/abs/2412.20138
QuantAgent: https://arxiv.org/abs/2509.09995
FinMem: https://arxiv.org/abs/2311.13743
FinAgent: https://arxiv.org/abs/2402.18485
FinRobot: https://arxiv.org/abs/2405.14767
AlphaEval: https://arxiv.org/abs/2508.13174