HKU 汇报聚焦版大纲 · 2026-05-20¶

目标：把内容从“完整项目介绍”收束成一条更清楚的研究汇报线。
推荐时长：12-15 分钟汇报 + 讨论。
核心结构：最近做了什么 → review 发现了什么 → 下一步可以尝试哪些 baseline。

0. 一句话开场¶

我最近做了两类工作：一类是把 crypto trading agent 推到一个可验证的 production research testbed；另一类是系统 review 了 AI-driven alpha search 和金融 agent SOTA。我的当前判断是：与其马上做一个复杂 agent，不如先定义几个可复现 baseline，看看在严格 crypto verifier 下哪些路线真正站得住。

更短版本：

我最近完成了一个可执行 alpha verifier，也做了金融 SOTA agent review；下一步最现实的研究推进，是先做一组 baseline，把 alpha auto-search 的问题变成可比较实验。

1. 第一部分：我最近具体做了什么¶

时间：3-4 分钟。
目的：只讲和研究问题有关的工程资产，不展开完整产品史。

1.1 Production crypto trading agent¶

只保留三点：

LLM 不直接接触 OMS，只产生 schema intent。
下单前必须经过 deterministic risk gate、button confirmation、OMS、audit、kill switch。
这个系统的研究价值不是“让 LLM 交易”，而是提供一个 hard verifier。

一句话：

The system is useful for research because it enforces generator-verifier separation in a real execution setting.

1.2 M8.6 walk-forward verifier¶

保留最关键指标：

526 Binance USD-M perpetual symbols。
15s bar + microstructure features。
12-fold rolling walk-forward。
microstructure gate + adaptive state controller。

要讲清楚的点：

这不是一个 isolated backtest，而是一个能反复筛选、验证、降级、拒绝候选 symbol/alpha 的 verifier。

1.3 最近的 negative result¶

这是最值得讲的研究信号。

Experiment	Validation / Search	Chronological Test	Interpretation
LightGBM	+44.65 MTM	-86.87 MTM	validation signal did not transfer
Optuna TPE	+44.46 objective	-105.78 objective	search overfit the time slice

结论：

The bottleneck is time-slice stability, not model capacity.

中文：

当前问题不是模型不够强，而是候选规则在时间切片之间不稳定。

这自然引到 Prof. Guodong Li：

如果时间切片不稳定是核心问题，那么显式 recurrence prior / statistical verification 可能比盲目扩大模型更重要。

2. 第二部分：我做了什么 review¶

时间：5-6 分钟。
目的：证明自己不是只从个人项目出发，而是已经做了一轮结构化领域 mapping，并且这轮 review 改变了你的 proposal。

开头可以这样说：

为了避免把自己的 production experience 误当成 research gap，我做了三层 review：AI discovery 的通用范式、金融 alpha-mining / agent SOTA、以及 backtest overfitting 的统计方法论。最后产出不是一堆 citation，而是一个 baseline design decision。

2.0 Review 工作量一页概览¶

可以用一页 slide 展示你做过的 artifact，而不是逐篇念论文。

Review Artifact	覆盖内容	作用
`alpha_search_baselines.md`	FunSearch / AlphaProof / AlphaEvolve / AI Scientist / ASI-ARCH	抽象出 generator-verifier、cognition base、compute-scaled discovery
`alpha_search_survey_taxonomy_and_bibliography.md`	8-tradition systematic survey	把 alpha auto-search 放到更完整的研究谱系里
`alpha_search_deep_reads_expanded.md`	PBO/DSR、time-series foundation models、AlphaEval	把 proposal 从“agent idea”拉回严格评估
`financial_sota_agent_survey.md`	AlphaBench、RD-Agent(Q)、Hubble、FactorMiner、CogAlpha 等 SOTA	明确哪些 claim 不能说，哪些 gap 还成立
`crypto_alpha_bench_risk_analysis.md`	7 个反方风险 + ETH validation experiment	让计划变成可被反驳、可收敛的实验

一句话：

这轮 review 之后，我把原来比较大的“做一个 alpha agent”收缩成更稳的“先做 baseline suite / verifier benchmark”。

2.1 AI for Science / AI discovery review¶

这一层回答：

为什么 alpha search 可以借鉴 AI-for-science，而不是只是金融工程？

代表性工作：

FunSearch。
AlphaProof / AlphaGeometry。
AlphaEvolve。
AI Scientist。
ASI-ARCH。

我从这些工作抽出的共同模式：

Generator-verifier separation。
Structured cognition base / domain prior。
Multi-agent decomposition。
Compute-scaled discovery。

和我的项目的映射：

Pattern	My current status
Generator-verifier separation	已经在 production system 里实现
Hard verifier	已有 risk gate / OMS / walk-forward verifier
Cognition base	还没有
Researcher agent	还没有
Compute-scaled discovery	还没有

一句话：

我不是从 agent architecture 开始，而是从 verifier 开始；这和 AI discovery 的成功模式是对齐的。

2.2 Financial SOTA agent / alpha mining review¶

这一层回答：

现有金融 agent / alpha-mining SOTA 已经做到哪里？我还能 claim 什么？

这里要主动承认：领域已经很近了，不能说没人做。

必须提到的系统：

Category	Examples	What they cover
Formula alpha benchmark	AlphaBench, AlphaEval	formula generation / evaluation
Quant R&D agents	RD-Agent(Q), Beyond Prompting	hypothesis → code → backtest
Alpha search methods	QuantaAlpha, AlphaAgent, Alpha Jungle, AlphaSAGE	LLM search / MCTS / GFlowNet
Safety / memory agents	Hubble, FactorMiner, CogAlpha	sandbox, memory, code evolution
Trading decision agents	TradingAgents, QuantAgent, FinMem	trading action, not alpha benchmark

我做 survey 时不是只按论文名分类，而是按 benchmark 相关维度抽取：

Extraction Dimension	Why it matters for my proposal
Search unit	formula、factor、code、trajectory、trading action 是不同问题
Generator	LLM prompting、evolution、MCTS、GFlowNet、multi-agent loop 不可混为一谈
Verifier	Qlib backtest、factor metric、real-market PnL、sandbox execution 强度不同
Data / market	equity、A-share、crypto、multi-asset 的 claim 不能直接迁移
Cost / fill	是否显式处理交易成本、深度、partial fill、tradability
Multiple testing	是否报告 DSR/PBO/null search，决定结果可信度
Compute control	LLM token、candidate count、wall-clock 是否公平
Reproducibility	是否有固定数据、固定 protocol、可复现实验

关键判断：

The claim is not “nobody has done alpha agents.” The claim is that current work does not jointly evaluate executable crypto alpha under cost, fill, statistical, and compute constraints.

中文：

不是说没人做 alpha agent，而是现有工作大多停在公式质量、回测收益或交易决策，还没有把 crypto 可执行性、成本、fill、统计校正和 compute budget 放进同一个 benchmark protocol。

这部分可以加一句更能体现工作量的自我修正：

Review 之后我主动放弃了一个更宽但不稳的 claim：“field has no alpha benchmark”。现在更准确的 claim 是：“field lacks an executable crypto alpha-search benchmark under cost/fill/statistical/compute constraints。”

2.3 Backtest overfitting / statistical rigor review¶

这一层回答：

为什么 baseline 不能只报 Sharpe、IC 或 MTM？

Review 线：

PBO / CSCV。
Deflated Sharpe Ratio。
Harvey-Liu-Zhu multiple testing。
Hou-Xue-Zhang anomaly replication。

对你的影响：

任何 LLM / Optuna / GP / agent search 都是在大规模多重检验。只报 Sharpe 或 IC 不够，必须报告 DSR、PBO、null search baseline。

这部分和你最近项目的连接：

My recent result	Statistical interpretation
LightGBM val +44.65 → test -86.87	典型 time-slice instability / overfit warning
Optuna search +44.46 → test -105.78	搜索过程本身产生 multiple testing pressure
M8.6 adaptive gate	是 production screening protocol，但还不是学术显著性证明

一句话：

这轮 statistical review 让我意识到：baseline suite 的目标不是找到最高 PnL，而是测出“搜索方法在多重检验之后还有没有可复现 edge”。

2.4 Review 之后 proposal 发生了什么变化¶

这页很重要，因为它能显示 review 不是装饰，而是改变了你的判断。

Before review	After review
想做一个 self-evolving alpha agent	先做 baseline suite / verifier benchmark
可能 claim “领域没有 benchmark”	改成 “缺 executable crypto alpha benchmark”
重点放 LLM agent architecture	重点放 fixed verifier + comparable baselines
只看 walk-forward MTM	加入 DSR / PBO / null search / compute budget
直接追求复杂 multi-agent	先比较 GP、LLM formula search、minimal agent loop

过渡到第三部分：

所以我现在不想一上来 build a big agent。我想先做一组 baseline，把问题压成一个老师可以判断、审稿人可以复现的实验。

3. 第三部分：下一步不是直接做大 agent，而是先做 baseline suite¶

时间：5-6 分钟。
目的：把 proposal 从“大 benchmark”压缩成一个老师容易评估的小实验计划。

3.1 Baseline A · Production verifier baseline¶

名称：

M8.6 Tradability Gate Baseline

输入：

15s OHLCV。
spread / depth / best bid-ask notional / sample count。
current adaptive state。

输出：

symbol 是否可交易。
shadow / probation / tradable / cooldown。
expected cost-adjusted outcome。

意义：

这是我已有的 hand-engineered production baseline。所有 learning/search baseline 至少要 beat 它，而不是只 beat random。

为什么重要：

它体现真实 execution constraint。
它能区分 paper alpha 和 executable alpha。
它是我相对其他学术团队的独特资产。

3.2 Baseline B · Classical symbolic search baseline¶

名称：

GP / AlphaGen-style Expression Search

做法：

定义一个小 DSL：returns、volatility、volume、spread、depth、funding、cross-symbol primitives。
用 genetic programming / evolutionary search 生成表达式。
固定 candidate budget。
在同一个 verifier 上评估。

要回答的问题：

在 crypto microstructure setting 下，非 LLM 的 symbolic search 能做到什么水平？

为什么要做：

这是最干净的 non-LLM baseline。
如果 LLM agent 连它都打不过，agent story 会变弱。
如果它表现不错，说明 benchmark 有基本信号。

3.3 Baseline C · AlphaBench-style LLM formula search¶

名称：

LLM Formula Search under Fixed Budget

做法：

给 LLM 固定 primitive list 和表达式语法。
生成 formula alpha。
比较几种 prompting/search：
direct generation；
Chain-of-Experience；
Tree-of-Thought；
evolutionary refinement。
强制相同 token budget / candidate budget。

要回答的问题：

AlphaBench-style formula mining 迁移到 crypto executable verifier 后，还能不能保持优势？

为什么要做：

直接对接当前 SOTA。
不需要一开始做复杂 multi-agent。
能快速得到可发表的 baseline table。

3.4 Baseline D · Agentic R&D baseline¶

名称：

Minimal RD-Agent-style Research Loop

最小版本：

Researcher proposes hypothesis。
Engineer writes formula/code。
Verifier runs walk-forward evaluation。
Analyst summarizes failure and suggests next mutation。

第一版不要做太复杂：

不做完整 autonomous system。
不接 live execution。
不让 agent 改 verifier。
只允许它在 fixed search space 内迭代。

要回答的问题：

在同样 compute budget 下，multi-step agent loop 是否真的比 simple LLM formula search 好？

3.5 Baseline E · Human expert / assisted baseline¶

名称：

Human Expert vs LLM-assisted vs Autonomous Search

只有在 trader 配合足够时做。

三组：

Track	Input	Output
Human-only	expert watches market / gives rule	candidate rule
LLM-assisted	expert gives intuition, LLM formalizes	formula / code
Autonomous	agent sees data only	formula / code

要回答的问题：

LLM 最有价值的位置是替代 expert，还是把 expert 的 tacit knowledge 形式化？

这个 baseline 是差异化亮点，但不是 v0 必须完成。

4. 推荐的最小 v0 实验¶

如果老师问“你下一步具体做什么”，给这个答案。

v0 scope¶

Universe:

BTC / ETH / SOL 或 top-20 Binance USD-M perps。

Frequency:

15s / 1m，先不要 tick。

Period:

fixed historical window + untouched chronological holdout。

Baselines:

Random / naive momentum。
M8.6 tradability gate。
GP expression search。
LLM formula search。
Minimal agentic loop。

Metrics:

cost-adjusted MTM。
hit rate / stop-loss / timeout distribution。
turnover / cost sensitivity。
DSR。
PBO / CSCV if sample size allows。
compute budget。

最小目标：

Produce one clean baseline table showing which search method survives the same executable verifier.

5. 推荐 slide structure¶

这版可以做成 8-10 页，比之前更收束。

#	Slide	Main claim
1	Title	From recent project to baseline suite for crypto alpha search
2	What I recently built	I built the verifier side, not just a trading interface
3	Negative result	Time-slice stability is the bottleneck
4	What I reviewed	AI discovery and financial SOTA point to generator-verifier + benchmark discipline
5	SOTA landscape	Existing systems are close, so the claim must be narrower
6	Evaluation principle	Alpha search is multiple testing; DSR/PBO/null search must be included
7	Baseline suite	A-E baseline options under one verifier
8	Minimal v0	Start with small universe, fixed budget, clean table
9	HKU fit	Prof. Li: time-series/statistics; Prof. Han: open-world reliability
10	Ask	Which baseline should I prioritize first?

6. 明天汇报的最终 Ask¶

不要问：

Do you think my whole project is good?

要问：

Among these baseline directions, which one would make the strongest first research artifact?

更具体：

Should I start from M8.6 + GP + LLM formula search as the v0 benchmark table?
Is the agentic loop baseline worth including immediately, or should it be v1?
For Prof. Li: should the first method paper focus on time-slice stability / recurrence prior?
For Prof. Han: is the open-world LLM-agent safety framing credible, or too far from vision/embodied AI?

7. Closing line¶

I am not trying to claim a finished benchmark tomorrow. I want to start with a small, rigorous baseline suite and use it to decide which alpha-search direction is real.

中文：

我明天不是想 claim 一个完整 benchmark 已经成立，而是想提出一组小而严格的 baseline，用它判断哪些 alpha-search 路线真的值得继续放大。