Skip to content

Alpha Auto Search · Research Plan

A consolidated research plan on AI-driven alpha factor search — what's been done, what's missing, and what's worth doing next.

Last updated: 2026-05-18 · Maintained by Paul Weng (paulweng)


0. Executive Summary

In 2025, AI for Science / AI for Math entered the "AlphaGo Moment" era. DeepMind's ASI-ARCH paper ("AlphaGo Moment for Model Architecture Discovery", arXiv:2507.18074) demonstrated an empirical scaling law for scientific discovery itself — cumulative SOTA findings vs. compute trace a linear curve, meaning research output can be scaled computationally.

This repo systematically maps what these advances mean for automated alpha factor search in quantitative finance, identifies a clear research-shaped gap, and proposes a concrete agenda.

Core thesis (one sentence):

Alpha auto search has the same generator-verifier structure as AlphaProof / AlphaEvolve / ASI-ARCH, but the field has no unified baseline. Building one is both a research contribution in its own right and the prerequisite for everything else.

Concrete next move: Crypto-Alpha-Bench — a unified benchmark for alpha auto search, modeled on ImageNet / SWE-Bench style infrastructure papers. Section 5 below.


1. Background · Why This Matters Now

Three things happened in 2024-2026 that make this timing right:

1.1 AI for Science 从理论范式变成可复现的工程范式

  • FunSearch (Nature 2023) 证明 LLM + 演化能解开数学开放问题
  • AlphaProof / AlphaGeometry 2 (Nature 2025) 在 IMO 拿银牌,明确了"生成器 + 形式可信 verifier"范式
  • AlphaEvolve (DeepMind 2025) 把这套工程化——发现 4x4 矩阵乘法 48 次乘法(57 年来首次改进)
  • AI Scientist v2 (Sakana 2025) 首次有 AI 写的论文通过 peer review (ICLR 2025 workshop)
  • ASI-ARCH (GAIR 2025) 在 20,000 GPU hours 发现 106 个新 SOTA 线性 attention 架构,且给出第一条 Scaling Law for Scientific Discovery

1.2 LLM-driven alpha mining 工作集中爆发

2025 一年内出现了 AlphaAgent (arXiv 2502.16789)、Navigating the Alpha Jungle (arXiv 2505.11122)、QuantaAlpha、FactorMAD、Alpha-GPT 等。但评估各自为政,互相不可比。FITEE 2025 出了一篇 "Survey on LLM-based Alpha Mining" (link.springer.com/article/10.1631/FITEE.2500386),确认这是一个正在形成的子领域。

1.3 严格统计方法学已经成熟,但尚未渗透 LLM-driven 工作

López de Prado 系列(PBO 2013、DSR 2014)+ Harvey-Liu-Zhu 2016 RFS + Lucky Factors 2021 JFE + Hou-Xue-Zhang 2020 RFS 已经给出金融多重检验严格化的完整工具箱。但翻遍 2025 LLM-driven alpha 文献,几乎没有引用这条线。这是一个非常明显的研究空白——把严格 multiple-testing 框架引入 LLM-driven 评估,能立刻在方法论严谨性上拉开和现有工作的距离。


2. 已完成的工作

本 repo 当前包含 3 份研究 artifact:

2.1 alpha_search_baselines.md — Frontier Baseline Notes

精读 5-7 篇 frontier paper,每篇按 "方法论拆解 + 对 alpha 搜索的迁移路径" 双视角组织:

  • FunSearch (Nature 2023) — LLM + evolution + program search
  • AlphaProof / AlphaGeometry 2 (Nature 2025) — neuro-symbolic + RL + formal verification
  • AlphaEvolve (arXiv 2506.13131) — FunSearch 工程化升级
  • AI Scientist v2 (arXiv 2504.08066) — full research loop + Agentic Tree Search
  • ASI-ARCH (arXiv 2507.18074) — multi-agent + scaling law
  • OpenSeeker (arXiv 2603.15594) — 开源小模型 + 合成训练数据范式

加上对 meta 化层级(L0-L4)的分析和金融领域已有迁移工作(AlphaAgent / Alpha Jungle / QuantaAlpha)的对照。

2.2 alpha_search_survey_taxonomy_and_bibliography.md — 8-Tradition Systematic Survey

为整个领域建立 6 维 taxonomy(搜索单元 / 生成器 / 验证器 / Knowledge grounding / 评估严格性 / 自进化层级),然后按 8 个 tradition 系统建 bibliography:

  1. Classical GP / Symbolic Regression for Finance(gplearn → AlphaSAGE/GFlowNet)
  2. Deep Learning Factor Models(FactorVAE / HIST / FactorGCL)
  3. LLM-Driven Alpha Mining(已覆盖 + FITEE 2025 survey 补充)
  4. AI for Science Transfer(baseline notes 主战场)
  5. 回测方法论严格化(PBO / DSR / Harvey-Liu / Hou-Xue-Zhang)—— 之前完全没碰过的盲区
  6. 时间序列 Foundation Models(PatchTST / Chronos / TimesFM / Moirai / Lag-Llama / Encoding Recurrence)
  7. Conformal Prediction for TS & LLM Agents(TCP / Sequential CP / Prune n Predict)
  8. Factor Zoo & Anomaly Replication(Hou-Xue-Zhang / JKP / Taming the Factor Zoo)

最后 surface 出 5 个明确的研究空白(Gap 1-5),其中两个直接对应下面的 RQ。

2.3 alpha_search_deep_reads.md — Three Deep-Read Clusters

为关键问题准备的深读 cluster:

  • Cluster 1: López de Prado 系列 + Harvey-Liu "Lucky Factors" + Hou-Xue-Zhang Replicating Anomalies——这一线提供"如何把经验观察讲成方法学结论"的语言
  • Cluster 2: 时序 Foundation Models + Encoding Recurrence into Transformers (Li Guodong, ICLR 2023 Oral)——为 RQ1 提供 positioning 的对照面
  • Cluster 3: AlphaEval (arXiv 2508.13174)——2025 年最新的"alpha mining unified evaluation framework"

3. Core Thesis

读完前沿 + 系统 survey + 深读后形成的判断:

3.1 Pattern Recognition

近 3 年所有突破性工作共享 4 个 pattern

  1. Generator-Verifier Separation —— 创造性与保证性解耦
  2. Cognition Base / Knowledge Grounding —— 结构化领域先验是 scaling law 的隐藏因果变量
  3. Multi-Agent Decomposition —— 职责分离比模型能力更重要
  4. Scaling Law for Discovery —— 算力可以直接换发现产出(前提是前 3 条都成立)
Frontier Pattern 当前 alpha auto search 现状 我已交付的(生产系统)
Generator-Verifier Separation LLM-driven 工作部分实现,但 verifier 太弱 ✅ Production-grade(详见 whatsapp_AI_Trading_Agent
Hard Verifier 大多用单一 IC/Sharpe,无 multiple-testing 校正 ✅ Walk-forward + microstructure gate + adaptive state controller
Cognition Base 几乎没人做 ❌ 完全没做
Researcher Agent 部分实现(AlphaAgent / Alpha Jungle) ❌ 完全没做
Compute-scaled discovery 没有 controlled experiment ❌ 单人本地规模

我有 Verification 这一半,缺的是 Generation/Discovery 这一半,且整个领域缺 unified baseline。

3.3 Key Insight: The Field Has No ImageNet Moment

读完整个 Tradition 1-8 后最深的判断 ——

alpha auto search 领域目前没有公认 baseline。每篇 paper 自己定义数据、cost model、metric、fold structure,互相不可比。所谓 "SOTA" 都是 hand-picked 评估下的声称。这件事本身就是阻碍 scaling law 在金融上成立的根本原因——没有 fixed evaluation,所谓"compute → discovery"的因果关系无法验证。


4. 三个核心研究问题

这三个 RQ 互相支撑构成研究三角:

RQ1 · Architectural Priors for Crypto Microstructure

Encoding Microstructure Recurrence: Architectural Inductive Biases for High-Frequency Financial Time Series

Hypothesis:

(OHLCV × microstructure × time) 联合空间里的决策边界含有一种递归性结构——价格行为反馈到 microstructure,microstructure 再反馈回价格——而 tabular / treelike model 把这种结构当 noise 学。这是 LightGBM / Optuna 在 chronological split 下显著过拟合(val MTM +44 → test MTM -86)的根本原因。

Approach:

将 Li Guodong 2023 ICLR Oral 的 Encoding Recurrence into Transformers 核心思想——为递归结构设计 explicit architectural prior——扩展到金融微结构数据。具体形式:

  • bookTicker → OHLCV → bookTicker 的反馈循环作为 cross-modality recurrence
  • 不同时间尺度(15s tick / 1m bar / 1h trend)的层次化递归
  • 在 526 symbol × 12 fold walk-forward pipeline 上 benchmark (i) tree models (ii) standard time-series Transformers (iii) inductive-bias-aware architecture

Why it matters:

  • 学术意义:把 architectural prior 研究扩展到有 hard verification(real PnL)的领域
  • Positioning:在 2024-2026 时序 foundation model 的 scaling 路线和 architectural prior 路线之间取舍,2025 position paper "No Champions in Long-Term TSF" 表明这件事远未定论
  • Connection to Prof. Guodong Li: 直接延伸 ICLR 2023 Oral 工作

RQ2 · Open-World LLM Agent Safety

Beyond Schema Validation: Statistical Safety Guarantees for LLM Agents in High-Stakes Decision Settings

Hypothesis:

In an open-world setting, the failure modes you can enumerate are not the failure modes that will actually happen.

当前 LLM agent 安全机制全是 deterministic rules(schema validation / cap stack / reduce-only / KMS 抽象 / audit append-only)。在 production 是必要的,但它假设你能写出所有规则

Approach:

conformal prediction 应用到 LLM agent output 的可信度估计——给每个 LLM 生成的 intent 一个 distribution-free coverage guarantee。设计 dynamic safety margin 机制——当 conformal interval 变宽(分布漂移信号),系统自动收紧 deterministic gate 阈值。用现有 production system 做 testbed,L1-L6 中的某些 layer 装上这个机制做对照实验。

Why it matters:

  • 学术意义:当前 LLM safety 研究主要在 alignment / RLHF 层面,缺少统计上有保证的 production-time safety mechanism
  • Connection to Prof. Kai Han: Open-world reliability 是 Visual AI Lab 的 stated mission;crypto markets 是 open-world 的极端实例
  • Cross-modal: 方法论原则上能扩展到 vision agent、robotic agent 等其他 open-world 设置

RQ3 · Cognition Base for Financial Domains

Building a Compute-Scalable Cognition Base for Discovery in Open Financial Domains

Hypothesis:

Cognition Base 的质量是 scaling law 能否成立的隐藏因果变量。

ASI-ARCH 的 scaling law 之所以成立,根本前提是 Cognition Base 注入了人类几十年的领域先验。在金融领域复制这套架构,最大的工程挑战不是 multi-agent 框架,是 Cognition Base 的构建——且这个 Cognition Base 必须是 replication-aware 的。

Approach:

跨数据源结构化:

  • Hou-Xue-Zhang factor zoo + JKP factor library(优先 verified anomalies
  • 微结构论文(market microstructure literature)
  • 监管文件 + 上市公司公告
  • 历史 anomaly 复盘 + 失效记录

每个 entry 抽取 {机制 / 检验方法 / 适用 universe / 失效 regime / replication strength / time-stamp 验证}

关键操作判断(由 Hou-Xue-Zhang 2020 直接驱动):

Hou-Xue-Zhang 2020 retest 447 published anomaly,发现 65% 失败 single-test,82% 失败 multiple-test。Cognition Base 不能是 published anomaly 的 mere collection,必须是 replication-aware structured knowledge——每个 entry 带 replication strength weight。Researcher Agent query 时按 weight 排序。

Ablation study:

同一 multi-agent 框架 + 不同 Cognition Base(劣质 / 中质 / 高质)→ 比较 scaling 曲线。如果质量显著影响 scaling,就证明了"Cognition Base 是 scaling law 因果驱动"的 hypothesis。


5. The Core Proposal · Crypto-Alpha-Bench

这是上面三个 RQ 的前提性基础设施——也是单独可以作为 standalone publication 的研究 contribution。

5.1 Motivation: Why a Benchmark Now

读完整个 8-tradition survey 后最强烈的感受——这个领域缺一个 ImageNet 时刻

现状问题 后果
每篇 paper 自己定义评估 跨方法无法对比,SOTA 声称无法验证
没有 fixed cost model 同一方法在 optimistic 和 pessimistic cost 下结果差几倍
没有 compute-controlled comparison LLM 方法的 SOTA 可能只来自 compute 优势
几乎不报告 Deflated Sharpe / PBO False discovery rate 不知道
没有 negative-control baseline "我比 random 强"不是合法 contribution claim
工业 ALPHA 和学术 ALPHA 没有 distinguish 高 IR 但不可执行的因子被当 winner

核心论断:在 unified baseline 出现之前,alpha auto search 的"compute → discovery"因果关系无法验证,scaling law 无法在金融上 establish。先建 benchmark,再谈 scaling

5.2 Six Requirements of a Good Baseline

R1 · Fixed public dataset

明确 universe(Binance USD-M top-200 perp)+ 时间窗口(2022-01 → 2025-12)+ 频率(1m / 15s / tick)。Public-releasable format(parquet + manifest),发到 HuggingFace Datasets

R2 · 三档 cost model

Tier Slippage Spread Queue Priority Penalty
Optimistic 0 mid-price
Realistic 经验冲击 half-spread partial fill probability
Pessimistic 上限冲击 full spread queue priority hard rejection

强制报告全部三档下的 metric。

R3 · Compute-controlled budget

固定 token budget / GPU hours / wall-clock time。LLM-driven 和 GP 在 compute 上差 1-2 个数量级,不控制就是不公平对比。

R4 · 5+ 维评估强制报告

维度 Source
Predictive Power AlphaEval 2025
Stability AlphaEval 2025
Robustness to Market Perturbations AlphaEval 2025
Financial Logic AlphaEval 2025
Diversity AlphaEval 2025
Capacity 我额外加(产业相关)
Deflated Sharpe Ratio López de Prado 2014
PBO Bailey et al. 2013

不能 cherry-pick。

R5 · Synthetic ground-truth sub-task

在 main benchmark 之外加 synthetic data 子任务——已知真实 alpha 生成数据,测试方法是否 recover。isolate "方法能力" from "数据运气"。这件事在 vision 早就标配(toy task with known structure),alpha mining 几乎没人做。

R6 · Replication-aware "must-beat" baseline

JKP factor library 上验证过的 anomaly 作为最低 baseline。任何新方法必须 beat JKP-verified baseline 才能讨论 contribution。不允许"我比 random 强"作为合法 claim

5.3 Reference Baselines to Implement

最低 6-7 个 reference baseline,覆盖各 tradition:

Baseline Tradition 用途
Random Search Control Negative control floor
JKP-Verified Anomaly Pool Factor Zoo Must-beat baseline
gplearn (default config) Classical GP Tradition 1 representative
FactorVAE DL Factor Tradition 2 representative
AlphaAgent LLM-driven Tradition 3 representative
Frozen-LLM-prompting (GPT-4 direct) LLM Naïve LLM baseline
M8.6 Walk-forward + Adaptive State Tradability gate 我现有 production system

最后一项是关键——把我已有的 walk-forward + microstructure gate + adaptive state controller 包装成 standard tradability gate baseline,distinguish "学术 alpha" 和 "可执行 alpha"。

5.4 Strategic Positioning

为什么 benchmark paper 比 method paper 更值得做(在这个 stage)

  1. 审稿人友好:method paper 要 beat SOTA(受 reviewer discretion),benchmark paper 只要 protocol 严谨。NeurIPS Datasets & Benchmarks track、ICLR Benchmarks 都有专门 venue
  2. Leverage 高:所有后续 alpha auto search 论文都得 cite / 用 protocol,比单篇 method 影响 10x+
  3. 现有工程优势对齐:我已经有 70% infrastructure,别人复制要 6-12 个月
  4. 三个 RQ 的发表载体:建立 benchmark 后,RQ1 / RQ2 / RQ3 自然变成"在 benchmark 上 establish SOTA 或 negative result"
  5. 对 HKU 申请的双重 alignment:李教授 own 统计严谨性部分,韩教授 own open-world / adversarial robustness 部分

5.5 Risks & Open Questions

R-1 · 数据可发布性:Binance USD-M 历史数据本身公开,但我清洗、对齐、去 gap 后的版本是否能 release 需要确认。HuggingFace Datasets 上有类似 crypto OHLCV,但 15s 高频 + 全市场 + cleaned 的没人发过。

R-2 · 数据漂移问题:crypto 演化快,2022 年数据对 2026 年没 representative。benchmark 需要 versioned(v1 / v2 / v3 每 6 个月 refresh)。这反而是 feature 不是 bug——distribution drift 本身就是 benchmark 想 evaluate 的能力。

R-3 · Cost model 主观性:三档 cost 的具体参数怎么定?建议从公开数据反推:用 Binance fills 数据估计 realistic slippage / queue priority,公开校准过程。

R-4 · 谁来维护 leaderboard:第一年我自己维护。第二年起需要 community / 机构 sponsor(HKU lab?arxiv 资助?)。

R-5 · Compute budget 是否阻碍参与:100 GPU hour budget 对学术 lab 可承受,对个人玩家偏紧。可以分 small / medium / large 三档 budget tier,按 budget 分别排 leaderboard。


6. Connection to HKU

6.1 Prof. Kai Han (CDS / Visual AI Lab)

  • Open-world reliability: Crypto perpetuals 是 open-world learning 的极端实例(non-stationary、adversarial、unknown failure modes、costly mistakes)
  • Foundation model 视角: LLM L1-L6 agent stack 是 foundation model 在高风险决策环境下的部署案例
  • RQ2 直接相关: Open-world LLM Agent Safety = his lab's open-world theme + my production testbed
  • Benchmark 的 robustness/adversarial 子模块他可以 own

6.2 Prof. Guodong Li (SAAS)

  • Time series + financial econometrics 直接对口
  • ICLR 2023 Oral Encoding Recurrence 是 RQ1 直接 anchor:我的 RQ1 是把这个 thesis 从 standard TSF 扩展到 crypto microstructure
  • High-dimensional ML + Quantile Regression: 对 cross-symbol 联合建模、conditional distribution 建模都有方法学 contribution
  • Benchmark 的统计严谨性部分(PBO / DSR / synthetic ground-truth)他可以 own
  • 小样本时间序列(他和华为 2018 合作方向)和 crypto regime stability 问题有方法学交集

6.3 Combined Value Proposition

I bring production-grade infrastructure for the verification half of alpha auto search. HKU's Visual AI Lab brings open-world ML methodology; SAAS brings time series statistical rigor. Together we build the field's first unified benchmark.


7. 12-Week Roadmap

Phase Week Deliverable
0. 汇报 0 (this week) HKU talk to Prof Han + Prof Li; share this RESEARCH_PLAN
1. Proposal sharpening 1-2 Based on talk feedback, write 8-page benchmark proposal
2. Dataset prep 3-5 Clean + release Crypto-Alpha-Bench v0 dataset on HuggingFace
3. Protocol & metrics 4-6 Write protocol spec; implement evaluation infrastructure (5+ metrics, 3 cost tiers, DSR + PBO)
4. Reference baselines 5-8 Implement 6-7 reference baselines on benchmark
5. Synthetic ground-truth 6-9 Build synthetic data generator with known alpha; run reference baselines on it
6. Public leaderboard 9-10 Launch v0 leaderboard; document protocol
7. Benchmark paper draft 10-12 NeurIPS Datasets & Benchmarks submission draft

Parallel track(RQ1 first preliminary):

Week Sub-deliverable
4-7 RQ1: implement REM-inspired architectural prior; baseline against PatchTST / Chronos zero-shot
7-10 RQ1: walk-forward results + statistical significance (DSR + PBO)
10-12 RQ1: combined with benchmark paper as "use case" section

RQ2 + RQ3 留到 Phase 7+(benchmark 立住后再做)。


8. Risks · Open Questions · Reality Checks

8.1 What if benchmark idea doesn't fly?

如果 HKU 反馈"benchmark 不是研究 contribution",备选 plan B 是:

  • 直接做 RQ1 作为方法论 paper(architectural prior for crypto microstructure)
  • 用现有 M8.6 infrastructure 作为 evaluation testbed(不公开发布)
  • 在 RQ1 paper 里 implicit 推 benchmark idea

8.2 What if I'm wrong about "no champion"?

如果时序 foundation model(Chronos / TimesFM)在 crypto 高频上其实够好,RQ1 的研究 niche 缩小。preliminary experiment 必须先 verify "foundation model 在 crypto 15s 上确实表现差"。这件事 2-3 周能跑出来。

8.3 What if Cognition Base 投入产出比太差?

构建 replication-aware Cognition Base 是数据工程,可能 6-12 个月看不到 method-level 突破。Plan B 是把 Cognition Base 范围缩窄到 crypto microstructure literature(百篇量级),先证明 minimum viable case

8.4 单人 → 团队的 transition risk

我从 single-developer 进入研究环境,最大风险是协作 friction。preemptive mitigation

  • 所有代码 / data / protocol 走严格 docs + reproducibility 标准
  • 一开始就用 git + CI + standardized 流程
  • 把 self-evolution research reference 类型的 "design philosophy" 文档作为 onboarding material

8.5 学术压力 vs production discipline 的张力

学术鼓励 novelty,production 鼓励 reliability。我 explicit aware 这种张力,且认为"benchmark + 严格 verification" 是两者的 natural intersection——learn 学术 novelty,保留 production discipline。


9. Repository Contents Index

alpha-search-frontier-notes/
├── README (this file is RESEARCH_PLAN.md – also serves as entry point)
├── RESEARCH_PLAN.md                                       ← you are here
├── alpha_search_baselines.md                              ← 5-7 篇 frontier paper 精读
├── alpha_search_survey_taxonomy_and_bibliography.md       ← 8-tradition systematic bibliography
├── alpha_search_deep_reads.md                             ← 三个深读 cluster(汇报版摘要)
├── alpha_search_deep_reads_expanded.md                    ← 深读扩展版:逐篇方法解析 + 实验落地
├── financial_sota_agent_survey.md                         ← 金融 SOTA agent / benchmark gap 详版整理
├── crypto_alpha_bench_risk_analysis.md                    ← benchmark 反方风险分析
├── human_expert_in_loop_research_direction.md             ← human expert baseline / tacit knowledge 方向修订
├── HKU_MEETING_PREP_2026-05-20.md                         ← HKU 会前准备总包
├── HKU_ONE_PAGE_HANDOUT_2026-05-20.md                     ← 可发给老师的一页摘要
├── HKU_12_SLIDE_DECK_2026-05-20.md                        ← 12 页精简版 slide draft
├── HKU_REPORT_OUTLINE_2026-05-20.md                       ← 明日汇报大纲 + Q&A anchor
├── HKU_REPORT_OUTLINE_COMPACT_2026-05-20.md               ← 聚焦版:recent work → review → baseline suite
├── HKU_30_SLIDE_DECK_2026-05-20.md                        ← 30 页扩展版 slide outline
├── HKU_Crypto_Alpha_Bench_Report_2026-05-20.pptx          ← 明日汇报 PPTX(完整 benchmark 叙事版)
├── HKU_Baseline_Suite_Compact_Report_2026-05-20.pptx      ← 明日汇报 PPTX(聚焦 outline 版)
└── HKU_30_Slide_Baseline_Suite_Report_2026-05-20.pptx     ← 明日汇报 PPTX(30 页扩展版)

Reading order suggested:

  1. First-time visitor → this RESEARCH_PLAN.md only(30 min)
  2. Specific frontier paperalpha_search_baselines.md (Section 1-5)
  3. Wide-coverage surveyalpha_search_survey_taxonomy_and_bibliography.md
  4. Q&A preparation / 方法学严谨性alpha_search_deep_reads.md
  5. Deep technical preparationalpha_search_deep_reads_expanded.md
  6. SOTA agent / benchmark gapfinancial_sota_agent_survey.md
  7. Tomorrow's talkHKU_30_SLIDE_DECK_2026-05-20.md first; then HKU_30_Slide_Baseline_Suite_Report_2026-05-20.pptx

10. Contact & Collaboration

Maintained by: Paul Weng (paulweng) Related project: whatsapp_AI_Trading_Agent — production crypto CTA system that provides the verification testbed referenced throughout this plan

Current status (2026-05-18):

  • Phase 0 (HKU talk this week)
  • Solo research, single-developer
  • Looking for academic collaboration to scale from "production system + research thinking" to "production system + research community + scaled compute"

This document is a living research plan. Section 7 roadmap will be updated weekly during Phase 1-2; quarterly thereafter.


"My production system already implements the philosophy of generator-verifier separation that AlphaProof articulated. The next step — turning a deterministic execution platform into a self-evolving research platform, and building the field's first unified benchmark — is exactly where I want to do my research."