From Production AI Trading Agent to Self-Evolving Research¶

A Path Through the AlphaGo Moment for Architecture Discovery¶

听众：Prof. Kai Han（HKU CDS / Visual AI Lab · open-world learning, foundation models, generative AI）+ Prof. Guodong Li（HKU SAAS · time series, financial econometrics, high-dim ML）时长：25-30 min talk + 15-20 min Q&A 语言：中文为主，关键术语保留英文目的：研究汇报 + 面试性质 — 展示 (1) 已有产品级工程能力 (2) 独立的研究判断 (3) 想进入研究环境后的具体 agenda

Part A · 演讲稿（Talk Script）¶

0. 开场（~2 min · slide 1-2）¶

EN opener (for English-speaking evaluators in the room):

"Thank you Prof. Han, Prof. Li, for taking the time. I'm Paul. For the past year I've been building a production-grade AI trading agent for the crypto futures market. I'm here today to talk about three things: what I built, what the recent AlphaGo Moment for Model Architecture Discovery-style research is teaching us, and three specific research questions I'd want to work on if I joined a group like yours."

各位老师好，我是 Paul。过去一年多我单人完成了一个生产级的加密期货 AI 交易系统的 M0 到 M8.6 阶段，目前在 AWS 部署阶段，Phase 1.5 manual execution 已经在 Binance USD-M testnet 和 live 上跑通。今天我想用三幕剧讲三件事——

第一幕：这个系统我做了什么，为什么这么设计； 第二幕：从 FunSearch 到 ASI-ARCH 这一波 AI for Science / AI for Math 的工作，对我做的事意味着什么； 第三幕：基于现有系统 + 一个完整的 8-tradition 领域综述，我提出一个 concrete research contribution——Crypto-Alpha-Bench——以及 3 个 RQ 作为它的 use cases，希望从两位老师这里得到方向性反馈。

整个 talk 的 thesis 一句话：

My production system already implements the philosophy of "generator-verifier separation" that AlphaProof articulated. Reading the 8-tradition survey, I found the field has no unified baseline——and that's the missing prerequisite for everything else. My concrete proposal is to build it.

我现有的生产系统已经实现了 AlphaProof 阐述的"生成器与验证器分离"哲学。但通读 8-tradition 综述后我发现——alpha auto search 这个领域没有公认 baseline，这是阻碍其他所有进展的隐性前提。我的具体提议是把它建起来。

1. 第一幕：我做了什么（~10 min · slide 3-13）¶

1.1 用户画像与核心设计判断（slide 3-4 · 2 min）¶

这个项目不是给量化工程师做的研究终端，也不是普通散户的交易 app。用户是一位直觉极强、IT 操作耐受度极低的资深交易员——alpha 来自他的盘感和市场判断，他不应被 API、终端、复杂菜单消耗。系统的角色是"WhatsApp / Telegram 交易注意力保护层"，做翻译、补全、验证、风控、执行、记录、提醒。

从这个用户画像我推导出整个项目最核心的设计判断——也是今天 talk 中所有后续论述的基础——

LLM 永远不直接下单，LLM 只在 NLP layer 出现，永不进入交易决策路径。

LLM never reaches the OMS directly. LLM lives in the NLP layer only — never on the decision path of any real-money action.

这不是一句口号，是用十二条硬规则、一整套基础设施撑起来的产品承诺。

1.2 安全硬规则栈（slide 5 · 2 min）¶

举几条关键的：

LLM 输出必须经过 schema-validated Intent → deterministic risk gate → 按钮确认 → OMS —— 任何代码路径让 LLM 输出直接接 OMS 是 P0 问题
风控引擎纯规则驱动，不调 LLM —— RiskRule 写成 pure function + serializable dataclass（这是为 Phase 3 高频策略上线时直接编译进 Go execution gateway 准备的）
Audit append-only —— Postgres trigger 物理禁止 DELETE
API key 走 KmsProvider 抽象 —— 业务代码出现 os.environ.get("BINANCE_*_API_KEY") 是 P1 review 拒绝
4 级 kill switch —— user / account / exchange / global，触发不依赖 LLM、不依赖 messenger 在线
客户端订单 ID 必须 idempotent —— client_order_id 必须确定性绑定到 trace_id，重启/断线/重传不发新单
Reduce-only 强制 —— 任何"平 X%""清仓"输出的订单必须带 reduce_only=True
风控 cap 三层取最严 —— user / RiskTier / instrument，取 min，不是按用户优先

这一套东西的意义在于：当我后来读到 AlphaProof 的"Lean 内核作为绝对可信验证器"，我意识到——

我在生产系统里已经实现了一个金融领域的 Lean 内核。它不是数学上绝对正确，但在工程上有同样的承诺：所有不可逆动作必须经过它，且它的判定不被 LLM 触碰。

这一点是第三幕的研究 agenda 的基础。

1.3 已交付的栈（slide 6 · 1 min · 快速过）¶

M0-M8.6 全部交付（ruff + mypy strict 全绿，完整 pytest 通过）：

Domain model：Instrument / Universe / RiskTier / UserTradingProfile / Sub-account / Organization / Multi-tenant / Intent / Order / Fill / Position / AuditEvent
OMS：状态机 + 4 级 KillSwitch + ExchangeAdapterRegistry 多所路由（Binance + Bitget skeleton）
Live trading：Binance USD-M Futures live adapter + listenKey user stream + REST drift detection reconciliation
Telegram ingress：MessengerAdapter Protocol + bounded group dispatch（≤4 人）+ message_id 去重 + KmsProvider 抽象（为 M11 WhatsApp adapter 增量保留接口）
观测：OpenTelemetry + Prometheus 11 个 metric
Manual execution app（Phase 1.5）：在 Binance testnet 跑通，live 需双开关 + 小额 cap

1.4 LLM L1-L6 Agent 栈（slide 7-8 · 2 min）¶

这是和今天后续讨论的前沿论文最直接对应的部分：

Layer	功能	对应前沿论文角色
L1	M8-lite replay narrative	AI Scientist 的 "automatic writeup" 早期形态
L2	NL → 回测参数 → 预览卡 → 跑	ASI-ARCH 的 Researcher → Engineer 链路简化版
L3	9 个 read-only 工具的对话 agent	ReAct / tool calling 经典形态
L4	Top-N replay 卡 yellow 行 LLM 一句话解释	Analyst Agent 早期形态
L5	每日报告 narrative	AI Scientist 自动 writeup 工业版
L6	多轮回测迭代（上次请求自动作为下次 defaults）	Agentic Tree Search 的 1.5 层版本

整个 stack 走 LLMProvider 抽象——prompt caching（system / user profile / universe 三块缓存，目标 cache hit ≥80%）+ per-task timeout + circuit breaker + fallback chain（primary → secondary vendor → TemplateProvider）+ per-user budget guard。LLM scope 严格限于 4 类——universe 内 instrument 事实数据 / 用户系统状态 / 系统操作 / 拒单原因人话化。任何评价、教学、预测、闲聊一律拒绝并引导联系 assistant。

1.5 M8.6 滚动回测体系——重点页（slide 9-11 · 3 min · 这一段李教授会最关心）¶

这是整个项目里最接近"研究"的部分，也是我今天最想和李教授讨论的部分。

任务设定：在 Binance USD-M perpetuals 全市场 526 个 symbol 中，寻找当前 15s 周期上"易做"的 symbol——能用 passive maker buy at -0.5% × take-profit sell at +1% 干净进出的标的——给一个 production auto strategy 当 tradability gate。

我做的不是一个回测，是一套 walk-forward verification pipeline：

数据：本地 15s OHLCV 全量 perp 采集，526 个 symbol × 每个 ≥480 个 unique 15s bar
网格：4 个 offset（0.3% / 0.5% / 0.8% / 1.2%）× 12 fold rolling window
微结构 gate：spread / depth5 notional / best bid/ask notional / one-sided / sample 数 5 维硬筛
三层验证对照：OHLCV-only / tradable-only / allow-shadow-scaled（我后来设计的折中策略）

发现：纯 OHLCV gate 把 30 fold 都收进来，但 stop-loss 重灾，MTM -17 到 -22 USDT；纯 tradable-only 太严，只剩 1 fold；allow-shadow-scaled 模式——只要 fail 的是 thin best bid/ask notional 或 thin depth5，允许以 0.5x size + 额外 0.003 offset 运行——把 30 fold 收到 10 fold，MTM -0.13 到 +0.58 USDT，把执行风险砍掉了，但保留了 thin book 的合理机会。

Adaptive State Controller（slide 11 · 关键设计）：

我后续设计了一个把统计置信度和实际执行 size 显式分离的 controller：

State	触发条件	Size 倍数	Offset 加宽
shadow	默认入口	0.25x	+0.003
probation	1 次 clean live-grade read	0.5x	+0.0015
tradable	3 次连续 clean / 4 次中 2 次 clean	1.0x	0
cooldown / avoid	失败信号	0（不执行）	—

这个设计的本质是把"我应该相信这个 symbol 多深"和"我现在应该投入多大 size"解耦——和 Bayesian sequential analysis 是一回事，但落到了 production code。

最新的 staged adaptive walk-forward：30m timeout 变量下，selected 16 个，T/P/S = 2/4/10，positive 8，stop-loss 0，timeout 6，MTM +4.41 USDT（fee 已扣 0.45）。

1.6 Negative Result：A Research-Grade Finding（slide 12 · 2 min）¶

这一页我特别想和李教授讨论。

我做完 walk-forward 之后想问：能不能用 ML 把这个 tradability gate 学出来？跑了两组实验——

LightGBM 实验：用 chronological train / val / test 切分（不是随机切），feature 是 OHLCV + spread + depth + bookTicker。Validation 选 threshold 0.36 时 MTM +44.65；同一个 threshold 在 test 上 MTM -86.87。

Optuna TPE 实验：500 trial 自动调参 interpretable rule。Train + search objective +44.46；chronological test objective -105.78——因为选出来的 holdout rows 里 stop-loss 和 timeout 太多。

我得出的结论不是"ML 在金融里没用"——这种话是 cop-out。我的诊断是：

The bottleneck is time-slice stability, not model capacity. The feature set has ranking signal, but the underlying decision-boundary in the (return × microstructure × time) space drifts faster than any model can re-learn from chronological history.

瓶颈是 time-slice stability，不是模型容量。特征集本身有 ranking 信号，但 (return × microstructure × time) 空间里的决策边界漂移得比任何模型能从 chronological history 中重新学习的速度都快。

这个观察直接连到李教授 ICLR 2023 那篇 Encoding Recurrence into Transformers 的核心思想——给时间序列模型注入归纳偏置（inductive bias）比扩大模型容量更重要。第三幕我会把这件事展开成一个 research question。

1.7 Self-Evolution Research Reference（slide 13 · 1 min）¶

在做完 M8.6 后我在 2026 年 5 月主动写了一份内部文档：

"Self-Evolution Research Reference for Crypto CTA Agent"

这份文档的核心论点：

The project can support "self-evolution in the research process," but not "self-evolution in trade execution."

本项目可以支持"研究过程自进化"，但不能支持"交易执行自进化"。

这份文档划定了硬边界，并且预先识别了 5 个失败模式：

LLM-as-Judge 闭环 —— LLM 生成 + LLM 评判 = 在 LLM 偏好曲面上爬山
LLM Prior Leakage —— LLM 训练语料见过的 CTA 文献，新策略可能只是 rebranding
Benchmark Overfitting —— scheduler 反复 query 同一回测窗口，exploit noise
Agent Context Drift —— 纯文本沟通导致状态漂移
Demo Cherry-Picking —— 只展示最好结果，忽略失败率

我特别想强调的是时间点——ASI-ARCH 论文 2025 年 7 月出现，我在 2026 年 5 月独立思考并落地了一份同源的设计判断。这不是事后追热点，是同步独立的研究判断。

2. 第二幕：前沿在哪（~8 min · slide 14-19）¶

2.1 演进脉络（slide 14 · 1 min）¶

近三年 AI for Science / AI for Math 出现了一条清晰的演进线：

2023  FunSearch (Nature)          LLM + evolution + program search
2024  AlphaProof / AlphaGeometry  Neuro-symbolic + RL + formal verification
2025  AlphaEvolve                 FunSearch 工程化升级 + Pareto multi-eval
2025  AI Scientist v2 (Sakana)    Full research loop + Agentic Tree Search
2025  ASI-ARCH (GAIR-NLP)         Multi-agent architecture discovery + Scaling Law

2.2 共同模式（slide 15 · 2 min）¶

我从这条线读出四个共同模式——

Generator-Verifier Separation：LLM 负责生成（创造性），独立 verifier 负责验证（保证）。AlphaProof 用 Lean 内核做绝对可信 verifier。这是所有这些工作能 scale 的根本前提。
Cognition Base / Knowledge Grounding：ASI-ARCH 的真正护城河不是 multi-agent 编排，而是把人类 30 年的架构设计经验结构化进知识库，让 Researcher Agent 提假设前必须 ground 到已有的设计原则。
Multi-Agent Decomposition：ASI-ARCH 把研究拆成 Researcher / Engineer / Analyst 三角色，每个角色又有内部 sub-agent（Planner / Code Checker / Deduplication / Debugger）。关键洞察是"职责分离"比"模型能力"更重要。
Scaling Law for Discovery：ASI-ARCH 在 20,000 GPU hours / 1,773 实验中发现了一条累计 SOTA 架构数 vs 算力的线性 scaling law——科研产出本身可以被算力 scale。这是这一波最重磅的发现。

2.3 映射到我的系统（slide 16 · 2 min）¶

把这四个模式映射到我的项目，结果出乎我意料的整齐：

Frontier Pattern	My System	Status
Generator-Verifier Separation	LLM L1-L6 → deterministic risk gate → OMS	✅ 已实现，production-grade
Hard Verifier	Postgres audit append-only + KMS + KillSwitch + Cap stack	✅ 已实现
Multi-evaluator Pareto	OHLCV / tradable / shadow / adaptive state 4 维	✅ 已实现
Walk-forward + microstructure gate	M8.6 12-fold rolling window	✅ 已实现
Cognition Base	—	❌ 完全没做
Researcher Agent	—	❌ 完全没做
Multi-agent collaboration	—	❌ 当前是严格 separation 而不是 collaboration
Compute-scaled discovery	Single-developer local scale	❌ 没有 scale 验证

我有了 Verification 这一半，缺的是 Generation/Discovery 这一半。

2.4 To Prof. Han：Open-World Reliability 的特殊形态（slide 17 · 1.5 min）¶

韩教授 lab 的 stated mission 是 "build reliable AI systems for open-world use"。我想 claim 的是——

Crypto perpetual futures 是 open-world learning 的一个极端实例，只是模态不同：

Open-world signature：non-stationary distribution、adversarial counterparties、unknown failure modes、costly mistakes
Open-world for vision = unseen object categories；open-world for trading = unseen market regimes / new symbol launches / unprecedented microstructure
韩老师 lab 在做的 spatial intelligence 也面对类似挑战——agent 在 open environment 里需要可靠地行动

我的系统提供了一个有真实风险（real money）、有 hard verifier（financial reality）、有 LLM agent（L1-L6）的 open-world reliability 实验平台。这是 vision 领域很少有的设置。

2.5 To Prof. Li：Time-Slice Stability 与 Encoding Recurrence 的连接（slide 18 · 1.5 min）¶

李老师 ICLR 2023 的 Encoding Recurrence into Transformers 核心论点是：为时间序列任务设计架构性归纳偏置（architectural inductive bias），比单纯增加 Transformer 容量更有效。

我的 LightGBM / Optuna 失败实验给了一个 empirical evidence——

在 crypto 高频数据上，模型容量不是瓶颈；缺乏正确的 inductive bias 才是。

具体可以提的几个问题（留作第三幕展开）：

我的 feature 用了 OHLCV + spread + depth + bookTicker。这些 feature 之间存在递归结构吗？——bookTicker 的瞬态变化由 OHLCV 的中期趋势调制
现有 Transformer for time series（PatchTST、TimesNet、TimeMixer）都没显式处理 微结构 → 价格 → 微结构 的反馈循环
能不能把 Encoding Recurrence 的思想扩展到金融微结构数据？——把订单簿的 reactive nature 作为一种 recurrence 显式编码

2.6 第二幕小结（slide 19 · 30 sec）¶

The frontier and my system point at the same gap from different sides.

前沿研究从理论侧逼近：怎么让 LLM agent 在 open-world 里可靠？怎么让发现自动化？

我的系统从工程侧逼近：怎么让 LLM agent 在真实金钱风险下安全运作？怎么把 walk-forward 验证做严？

两边都缺的一块是 Researcher Agent + Cognition Base。这就是第三幕。

3. 第三幕：研究 Agenda（~8 min · slide 20-26）¶

这一幕讲一件具体的事——一个我希望在研究环境里做的核心 research contribution。三个 RQ 在它之后会自然展开，作为这件事的 use case。

3.0 转折：3 个 RQ 共享一个先决条件（slide 20 · 30 sec）¶

刚才第二幕讲了我的系统和前沿的 4 个共同 pattern，也讲了 To Prof. Han 和 To Prof. Li 两条 connection。基于这些 mapping，我心里有 3 个 RQ 想做——架构性归纳偏置 (RQ1, 偏李老师)、open-world LLM safety (RQ2, 偏韩老师)、Cognition Base 因果性 (RQ3, 跨两位)——但读完整个 alpha auto search 的 8-tradition 系统综述后，我发现这 3 个 RQ 共享一个 enabling prerequisite，且这个 prerequisite 本身就是更值得先做的研究 contribution。

3.1 The Field Has No ImageNet Moment（slide 21 · 1.5 min）¶

Reading the entire 8-tradition survey, my strongest single observation is this:

Alpha auto search has no agreed-upon baseline. None.

具体的现状问题——

现状	后果
每篇 paper 自定义评估	跨方法不可比，SOTA 都是 reviewer-discretion
没有 fixed cost model	同方法在 optimistic / pessimistic 下差几倍
没有 compute-controlled comparison	LLM 方法的 SOTA 可能只来自 compute 优势
几乎不报告 PBO / DSR	False discovery rate 不知道
没有 negative-control baseline	"我比 random 强" 被当作合法 contribution
学术 alpha vs 可执行 alpha 不区分	高 IR 但容量受限的因子被当 winner

Strong claim：在 unified baseline 出现之前，alpha auto search 的"compute → discovery"因果关系无法验证。换句话说，ASI-ARCH 那条 Scaling Law for Scientific Discovery 在金融上根本无法 establish——不是因为方法不对，是因为没有 fixed evaluation 可以测。

3.2 Proposal: Crypto-Alpha-Bench（slide 22-24 · 4 min）¶

Title: Crypto-Alpha-Bench: A Unified Benchmark for Alpha Auto Search

我提议的 concrete contribution——一个面向 alpha auto search 领域的 unified benchmark，模仿 ImageNet 之于 vision、SWE-Bench 之于 coding agent、MMLU 之于 LLM evaluation 的角色。

Six Requirements（slide 23）：

Fixed public dataset：Binance USD-M top-200 perp × 2022-2025 × 1m/15s/tick；HuggingFace Datasets format
三档 cost model：optimistic / realistic / pessimistic，强制报告全部三档——这件事直接 kill "cost-model picking" 作弊
Compute-controlled budget：fixed token / GPU hour / wall-clock；LLM 和 GP compute 差 1-2 个量级，不控制就不公平
5+ 维评估强制报告：AlphaEval 2025 那 5 维 + capacity + Deflated Sharpe + PBO（López de Prado 严格 multiple-testing）
Synthetic ground-truth 子任务：已知真实 alpha 生成合成数据，测试方法是否 recover——isolate "方法能力" 和 "数据运气"
Replication-aware must-beat baseline：JKP-verified anomalies 作为底线（Hou-Xue-Zhang 2020 表明 82% published anomaly 通不过严格 multiple testing，所以不能直接用 published claim）

Reference Baselines I'd Ship（slide 24）：

Baseline	Tradition	Role
Random Search	Control	Negative floor
JKP-Verified Anomaly Pool	Factor Zoo	Must-beat baseline
gplearn (default)	Classical GP	Tradition 1
FactorVAE	DL Factor	Tradition 2
AlphaAgent	LLM-driven	Tradition 3
Frozen-LLM-prompting	LLM	Naïve LLM control
My M8.6 Walk-forward + Adaptive State Controller	Tradability gate	Distinguish 学术 alpha vs 可执行 alpha

The last row is the key: 我已经做好的 walk-forward + microstructure gate + adaptive state controller，包装成 standard tradability gate baseline。它强迫所有候选 alpha 不只在 "predictive power" 上分高，还得在 "real-world executability" 上过门槛。别人完全没有这种 production-grade tradability infrastructure**——这是 6-12 个月别人复现不出来的工程优势。

Strategic Positioning（slide 24, footer）：

为什么 benchmark > method paper at this stage：

审稿人友好：method paper 要 beat SOTA（受 reviewer discretion），benchmark paper 只要 protocol 严谨。NeurIPS Datasets & Benchmarks track、ICLR Benchmarks 都有 dedicated venue
Leverage 高：所有后续 alpha auto search 论文都得 cite / 用 protocol，单篇 method paper 的 impact 不可比
现有工程优势对齐：我已经有 70% infrastructure
三个 RQ 的发表载体：建立 benchmark 后，3 个 RQ 自然变成 "在 benchmark 上 establish 新 SOTA 或 negative result"

3.3 三个 RQ 作为 Benchmark 的 Use Cases（slide 25 · 1.5 min）¶

Benchmark 立住后，我心里的 3 个 RQ 都有自然的 publication 路径——

RQ	Title	在 benchmark 框架内的形式
RQ1	Encoding Microstructure Recurrence	在 Crypto-Alpha-Bench 上对照 (i) tree models (ii) PatchTST / Chronos / TimesFM (iii) 我设计的 microstructure-recurrence-aware architecture——直接测试 architectural prior > scaling 的 thesis（连接 Prof. Li ICLR 2023 Oral）
RQ2	Open-World LLM Agent Safety	在 benchmark 的 distribution-shift sub-tasks 上 evaluate conformal prediction-based safety mechanism——用 benchmark 的 regime-shift 子任务建立 open-world reliability 的 first quantitative metric（连接 Prof. Han open-world theme）
RQ3	Cognition Base Causal Hypothesis	同一 multi-agent 框架 + 不同质量 Cognition Base（劣质 / 中质 / 高质 / replication-aware），在 benchmark 上比较 scaling 曲线——直接 quantify Cognition Base 是 scaling law 因果驱动因素的 hypothesis

Key insight：这 3 个 RQ 从抽象 research question 升级为 "在标准 benchmark 上的可验证 claim"。这是 academic-grade contribution 的关键转换。

3.4 12-Week Roadmap（slide 26 · 1 min）¶

Phase	Week	Deliverable
Proposal sharpening	1-2	8-page benchmark proposal（基于今天 feedback）
Dataset prep	3-5	Crypto-Alpha-Bench v0 dataset on HuggingFace
Protocol & metrics	4-6	Evaluation infra（5+ metrics, 3 cost tiers, DSR + PBO）
Reference baselines	5-8	6-7 reference baselines on benchmark
Synthetic ground-truth	6-9	Synthetic data generator + reference baselines on it
Public leaderboard	9-10	Launch v0 leaderboard
Benchmark paper	10-12	NeurIPS Datasets & Benchmarks submission draft

Parallel track（RQ1 preliminary）：Week 4-12 同步在 benchmark 上跑 RQ1 实验，作为 benchmark paper 的 "first use case" section。RQ2 + RQ3 留到 Phase 7+（benchmark 立住后）。

4. 收束 + Ask（~2 min · slide 27-28）¶

4.1 What I Built vs. What I Want to Build（slide 27）¶

What I built (over the past year, solo)： - Production-grade infrastructure for Verification side——M0-M8.6 全交付，AWS 部署阶段 - A real testbed with real money, real LLM agents (L1-L6), real time-series data, real walk-forward verification - An articulated philosophy on AI safety boundaries（self-evolution research reference, written May 2026, independently of ASI-ARCH） - A systematic 8-tradition survey of alpha auto search, surfacing the field's missing baseline

What I want to build (with academic collaboration): - Crypto-Alpha-Bench——the field's first unified benchmark, NeurIPS Datasets & Benchmarks submission target - RQ1 / RQ2 / RQ3 as benchmark use cases，分别 anchor 在 Prof. Li / Prof. Han / 跨两位的 expertise 上 - Eventually：把 verification platform 演化成 self-evolving research platform

4.2 What I'm Asking For（slide 28）¶

"I'm not asking for endorsement of what I've already done. I'm asking: of Crypto-Alpha-Bench and its three use cases, which part fits your group's direction, and how would you sharpen it?"

我不是来求确认现有工作的。我是来问——

Crypto-Alpha-Bench 这件事和三个 use case RQ，哪一部分和您的 group 方向有 resonance？您会怎么 sharpen 它？

具体到 ownership：

Prof. Han: benchmark 的 open-world robustness / adversarial 子模块、RQ2 的 conformal LLM safety
Prof. Li: benchmark 的统计严谨性（PBO / DSR / synthetic ground-truth）、RQ1 的 architectural prior thesis 验证
Both: benchmark 作为 cross-discipline research infrastructure 的整体 owner

如果方向 align，我可以把我现有的 production system 作为 research testbed 贡献出来——这是大多数学生进 lab 时缺的东西，我有；这也是 benchmark 的核心 unique selling point。

结尾：

Thank you. I'll be happy to take questions — in Chinese or English.

谢谢两位老师。欢迎提问。

Part B · PPT 大纲（25-30 页）¶

Slide 设计原则¶

每页只讲一件事。复杂内容拆成 2-3 页
视觉密度低——讲稿是 takeaway，slide 是 anchor
数字 / 图表 / 对照表 优先于纯文字
关键术语英文 / 中文双标 — 给可能的英文 evaluator 留 hook
每页右下角 speaker note pointer（讲稿里对应段落编号）

完整 Slide List¶

#	标题（中/EN）	核心内容	视觉建议	Speaker Note 对应
1	Title slide	项目名 + 我的名字 + 日期 + 一句 thesis	简洁，无 logo 堆砌	§0
2	Talk Outline / 演讲提纲	三幕剧 + thesis 一句话	三幕图示	§0
3	Project Context / 项目背景	用户画像：直觉极强 + IT 耐受度低；系统角色	用户照片（脱敏剪影）	§1.1
4	Core Design Judgment	LLM never on decision path 一句话 + 12 条硬规则纲要	一句大字 + 12 条列表	§1.1
5	Safety Stack	6-8 条关键规则展开，标注 production-grade	表格：规则 / 实现 / 强制级别	§1.2
6	Delivered Stack M0-M8.6	一页过完，标记 status（all green）	横向时间轴 + module map	§1.3
7	LLM Agent Stack L1-L6	6 个 layer + 每个对应 frontier 角色	表格 + 颜色编码	§1.4
8	LLM Provider Abstraction	抽象 + caching + fallback + budget	架构图	§1.4
9	M8.6 Walk-Forward Setup	526 symbols × 4 offsets × 12 folds; microstructure gate	数据流图	§1.5
10	M8.6 Three-Gate Comparison	OHLCV vs tradable vs shadow-scaled 的对比表	三栏对比表 + MTM 数字	§1.5
11	Adaptive State Controller	5 态状态机 + staged sizing	状态机图 + 表	§1.5
12	Negative Result: Time-Slice Stability	LightGBM/Optuna 数字 + 诊断结论	大字标题：Time-slice stability is the real bottleneck	§1.6
13	Self-Evolution Reference	一句话 thesis + 5 failure modes	文档截图 + 5 个 failure modes 卡片	§1.7
14	Frontier Evolution	FunSearch → AlphaProof → AlphaEvolve → AI Scientist → ASI-ARCH	时间线	§2.1
15	Four Common Patterns	4 个模式 + 一句话总结	2x2 grid	§2.2
16	Mapping: My System vs Frontier	8 行对照表（✅ vs ❌）	对照表，红绿对比	§2.3
17	To Prof. Han: Open-World Reliability	crypto = open-world for trading	类比图（vision OWL vs trading OWL）	§2.4
18	To Prof. Li: Time-Slice Stability ↔ Encoding Recurrence	您的工作 + 我的实验 connection	一行引用 + 我的实验数字	§2.5
19	Act 2 Summary	"Frontier and my system point at the same gap"	一句话 + 一张总结图	§2.6
20	Transition: 3 RQs Share a Prerequisite	30 秒过渡 + 引出第三幕主线	简洁，引出下一页	§3.0
21	The Field Has No ImageNet Moment	6 行现状/后果表 + strong claim	表格 + 一句大字	§3.1
22	Proposal: Crypto-Alpha-Bench	Title + motivation	大字 + 一句 thesis	§3.2
23	Six Requirements	6 条 requirement	6 卡片网格	§3.2
24	Reference Baselines I'd Ship	7 行 baseline 表（含我的 M8.6 作 tradability gate）	表格，高亮 M8.6 行	§3.2
25	Three RQs as Benchmark Use Cases	3 行表：RQ × benchmark 形式 × 教授 connection	表格	§3.3
26	12-Week Roadmap	Phase 1-7 时间轴	横向 timeline	§3.4
27	What I Built vs What I Want to Build	两列对比（done vs proposed）	表	§4.1
28	What I'm Asking For	benchmark + 3 RQ ownership 拆分 ask	大字 + 教授 ownership 配对	§4.2
29	Thank You / Q&A	联系方式 + GitHub repo link	简洁	—

Backup Slides（Q&A 时拉出来用）¶

#	标题	用途
B1	Detailed Risk Gate Architecture	被问"风控具体怎么实现"
B2	Microstructure Gate 5 Dimensions	被问微结构 gate 怎么算
B3	Reconciliation Algorithm	被问 OMS 怎么和交易所 state 对齐
B4	Why 15s, not other intervals	被问为什么选 15s
B5	LightGBM/Optuna 详细数字表	被问负结果细节
B6	Why ChatGPT / Claude not specific model	被问 LLM 选型
B7	What I'd do differently	被问"重做你会怎么改"
B8	5 Failure Modes 详细解释	被问 self-evolution reference
B9	RQ1 Detailed Approach	被问 architectural prior 具体形式
B10	RQ2 Detailed Approach	被问 conformal prediction LLM 怎么用
B11	RQ3 Cognition Base + Hou-Xue-Zhang	被问 published anomaly 82% 失败如何处理
B12	Benchmark Cost Model 三档参数	被问 cost model 具体数字怎么定
B13	Benchmark Compute Budget 设计	被问 compute-controlled 怎么 enforce
B14	Synthetic Ground-truth Generator 设计	被问 synthetic data 怎么生成
B15	Existing Crypto Data Survey	被问"Binance 数据公开可获取，为什么需要新 benchmark"

Part C · Q&A 预测 + 答题准备¶

韩凯教授可能问的（CV / open-world / foundation model 视角）¶

Q1: "你说 crypto trading 是 open-world learning 的一个实例。但 vision 里的 open-world 涉及 unseen visual category，trading 里你说的"unseen regime"具体是什么？怎么 operationalize？"

A: 好问题。我的 working definition 是——一个 regime 包含 (volatility level × directional bias × microstructure thickness × correlation structure) 四维。Unseen regime = 这四维联合分布从未在训练数据里出现的区域。Operationalize 的方法可以是 (a) clustering 历史 regime 得到一组 prototypes，新数据点到所有 prototypes 的最近距离作为 OOD score；(b) 或者更严格地，用 conformal prediction 给 LLM agent 的每个 intent 算 coverage interval，interval 突然变宽就是 unseen regime 信号。后者是我 RQ2 想做的事。

Q2: "你的 L1-L6 LLM agent 怎么保证不出 hallucination？特别是 L3 那个 read-only 对话 agent。"

A: 三层防御：(1) Scope 严格限于 4 类（universe instrument 事实 / 用户系统状态 / 系统操作 / 拒单原因），任何评价/教学/预测一律返回 ScopeOutOfBoundsRequest；(2) L3 的 9 个 tool 全是 read-only 拉数据，不允许 LLM 自由 generate 数字，所有 quantitative claim 必须 from tool output；(3) Confidence < 0.7 一律沉默原则，不主动输出。我的 hallucination 防御不依赖 LLM 自身能力，依赖架构约束。

Q3: "你做的是 single-developer 项目。这套方法论 scale 到更大团队会有什么挑战？"

A: 三个：(1) 安全规则的纪律性——一个人能保持 12 条硬规则不被绕，多人协作时需要 CI/code review 自动化执法（我已经在 CLAUDE.md 里写下了，但 enforcement 在我手上）；(2) Cognition Base 的构建本质是 domain expert + ML researcher 协作，单人能做 prototype，scale 需要团队；(3) 我的 LLM scope 限制 (4 类) 是基于一个用户画像设计的，扩展到多用户多场景需要重做 scope ontology。

Q4: "你提到的 RQ2 conformal prediction 在 LLM 上的应用——你看过 Cherian, Snell 那些 conformal language modeling 的工作吗？你的贡献会是什么？"

A: 看过。Cherian 那条线主要做的是 token-level 或 single-output 的 calibration。我想做的是 agent-level——一个 LLM agent 在 multi-turn / tool-use 设定下的累计 risk 估计。这个方向我的认知是早期，希望进 lab 后能系统读相关文献。我能贡献的是一个 real-stakes testbed——我的 production system 有真实的 cost function（PnL），可以拿来 ground 任何 conformal 方法的实际可用性。

李国栋教授可能问的（time series / financial econometrics 视角）¶

Q5: "你的 walk-forward 12 folds 太少。Statistical significance 怎么处理？"

A: 同意 12 folds 不够做严格 statistical significance 声明。我现在的定位是 "tradability gate"——筛掉明显不能做的 symbol——而不是 "profitability claim"。对于后者，我的 next-step plan 是：(a) 扩展到 100+ folds 的 block-bootstrap；(b) 引入 Deflated Sharpe Ratio 做 multiple-testing 校正；© 把 Probabilistic Sharpe Ratio confidence interval 加入 verification gate。这正是 RQ1 里需要您指导的地方。

Q6: "你说 LightGBM 在 chronological split 下过拟合。你试过其他切分策略吗？比如 purged k-fold？"

A: 试过 chronological split with embargo。结果类似——val 期间表现好的参数 test 期间表现差。我的诊断不是切分方式的问题，是底层 (return × microstructure × time) 联合分布真的在漂移——even with proper splitting, the underlying mapping you're trying to learn changes faster than chronological history allows recovery. 这就是为什么 RQ1 我提议做 architectural prior 而不是更好的 splitting / regularization。

Q7: "Encoding Recurrence into Transformers 里我们处理的 recurrence 是 explicit 的——time series 的 autoregressive 结构。你说的"microstructure recurrence"还需要具体定义——bookTicker 和 OHLCV 在数学上的关系是什么？"

A: 这是关键问题。我的 working hypothesis：(a) OHLCV 是 microstructure events 的 time-aggregated summary，所以 OHLCV 是 microstructure 的 deterministic function（at a lower temporal resolution）；(b) 微结构 events 之间存在 reactive coupling——一笔 aggressive market order 会立刻改变 best bid/ask 和 depth，影响下一笔 order 的 placement；© 这两层之间存在一种 "slow trend modulates fast microstructure" 的递归。如果这个 hypothesis 严格成立，cross-scale recurrence 是可以形式化的。但我目前停在 hypothesis 层面，没有形式化的数学，这正是希望和您合作的部分。

Q8: "你的 adaptive state controller 看起来像 Bayesian sequential analysis。有显式的贝叶斯公式吗？"

A: 现在不是严格 Bayesian——是基于规则的 staged promotion（3 次 clean read 才升级）。但它的 Bayesian 化是自然的：把 "clean read" 看作 Bernoulli observation，per-symbol state 看作 latent Bayesian belief over "this symbol is tradable"。把 promotion 阈值从 hard count（3 次）换成 posterior probability（P > 0.8）就是 Bayesian update。我没做这一步是因为 production code 需要 deterministic / debuggable，但研究版本可以做。

Q9: "你能做到什么 horizon 的 forecast？高频上还是稍长 horizon？"

A: 当前 M8.6 在 15s 周期上做 short-horizon tradability prediction（从 selection window 推到 next 30-60 min）。再长 horizon 需要不同 feature set 和 verification framework，目前没做。但我想强调的是——我的 walk-forward infrastructure 是 horizon-agnostic 的，换 feature + 换 forward-window 就能扩展到日内 / 日频 / 周频 horizon，infrastructure 复用度高。

跨两位老师都可能问的¶

Q10: "你是 single developer 做完这套——你是怎么管理复杂度的？"

A: 三件事：(1) 严格的 milestone-based delivery，M0-M8.6 每个 milestone 有明确 acceptance criteria 和 audit；(2) 文档纪律——每个设计判断都有对应的 doc（我让 you 们看到的 self-evolution reference 是其中一份），doc 和 code 不允许 silent drift；(3) 用 AI 作为协作者——Claude Code 作为 primary implementer，我做产品决策 + review，CLAUDE.md 一份 ~400 行的文件作为我们之间的契约。这是 AI-augmented engineering 在产品上的实际应用，本身也是一种研究 observation。

Q11: "你为什么从产品转研究？是觉得产品做不下去了吗？"

A: 不是。产品 v0.8 跑得很好，目前 LLM L1-L6 + 滚动回测都在 production 运行。我转研究的原因恰恰相反——我做到 M8.6 之后遇到的瓶颈不再是工程瓶颈，是研究瓶颈。Time-slice stability 不是写更多代码能解决的，是需要 architectural prior 的理论工作。Cognition Base 不是堆数据能解决的，是需要 domain knowledge structuring 的方法论工作。Open-world LLM safety 不是加规则能解决的，是需要 distribution-free guarantee 的统计理论工作。我想去研究环境是因为我现在站在三个研究问题的入口，单人继续做会变成 hobbyist research，进 lab 是把它做成严肃 research 的方式。

Q12: "如果给你选——你最想做三个 RQ 里的哪一个？"

A: RQ1（Architectural Priors）。原因有三：(1) 它有最 concrete 的实验设计——我已经有 walk-forward infrastructure，可以立刻跑对照实验；(2) 它和李老师的 Encoding Recurrence 有最直接的 intellectual lineage，最有可能做出可发表的工作；(3) 它的方法论价值最 transferable——如果 architectural prior for crypto microstructure 能 work，原则上能扩展到其他 cross-scale time series（医疗 ICU、能源市场、气象）。当然，我对 RQ2、RQ3 都有同样投入的意愿。

Q13: "你的项目代码会开源吗？"

A: 当前是 private repo。如果进入合作研究，方法论的部分（walk-forward 框架、microstructure gate、adaptive state controller、LLM provider 抽象）可以开源作为 research artifact。Production-specific 的部分（具体策略参数、交易所 key 管理、用户数据 schema）保留 private。Open-source decision 我会把 owner 利益和研究社区贡献都纳入考量。

Benchmark 相关追问（最重要的一组，预期会被深问）¶

Q14: "Crypto-Alpha-Bench 已经有类似工作了吗？AlphaEval (2025) 不就是吗？"

A: 关键区别。AlphaEval 是 evaluation framework——给定一组 alpha 候选，怎么 score 它们（5 维并行计算）。Crypto-Alpha-Bench 是 benchmark——固定数据集 + 固定 protocol + 固定 baseline + leaderboard，让不同 alpha auto search 方法可以直接比较。两者其实互补——Crypto-Alpha-Bench 可以把 AlphaEval 的 5 维评估作为 protocol 的一部分。具体说我会引 AlphaEval 作为 evaluation backbone，但加上 fixed dataset、三档 cost model、compute-controlled budget、synthetic ground-truth、JKP must-beat baseline 这些 AlphaEval 没有的部分。

Q15: "Crypto 数据本来就公开，做 benchmark 的 marginal value 是什么？"

A: 三件事是公开 raw data 给不了的——

Cleaned + aligned + gap-free + cross-symbol consistent：我已经做了 526 symbol × 15s 的本地数据基础设施，对齐 timestamp / 处理 missing buckets / 校对 funding events 跨 symbol 一致性，别人重做要 6-12 个月
Standardized protocol with three cost tiers and compute budget enforcement：让方法论之间真正可比，这是公开数据本身做不到的
Tradability gate as integrated component：我的 microstructure gate + adaptive state controller 是 production-grade，学术界没有 production-grade tradability infrastructure——这件事 distinguish 学术 alpha vs 可执行 alpha，是 benchmark 的核心 differentiator

Q16: "Benchmark paper 通常 5-10 篇引用就算被 community 接受了，你为什么相信 Crypto-Alpha-Bench 能成为 community 标准？"

A: 不一定能。这是个 placed bet。我相信的 4 个理由——(1) timing 对：FITEE 2025 survey 表明这是正在 formalizing 的 sub-field，需要 baseline；(2) infrastructure 优势：很少有研究者有 production-grade trading system 当 testbed；(3) cross-tradition 设计：6-7 个 reference baseline 覆盖 GP/DL/RL/LLM 各 tradition，让任何后续工作都需要引用；(4) 我能 commit 维护 leaderboard 至少 18 个月。但如果 community 不 adopt，fallback 是 RQ1 作为 standalone method paper，benchmark 作为它的 evaluation methodology section——风险不是 paper 没法发，是没法获得 community-level leverage。

Q17: "你怎么处理 benchmark gaming——人们针对 benchmark 优化但不解决实际问题？"

A: 三层防御——(1) 三档 cost model 强制报告：optimistic 上分高但 pessimistic 下分崩的方法立刻 expose；(2) Synthetic ground-truth 子任务：known alpha，方法是否 recover 直接可测，没法 game；(3) Held-out future window：benchmark v0 用 2022-2024 数据，v1 加 2025 数据，v2 加 2026 数据——对当前 benchmark 的 overfitting 会在 future window 立刻 expose。这也是为什么 benchmark 需要 versioning。

Q18: "如果 single-developer 维护 benchmark 中断怎么办？sustainability 是真问题。"

A: 同意这是真风险。Mitigation：(1) 第一年我 commit 维护，期间寻找 institutional sponsor（HKU lab、arxiv 资助、Crypto exchange 学术合作）；(2) Protocol + 数据 + reference baseline 全部 open source under permissive license，任何人可以 fork 继续维护；(3) 设计上 minimize 维护负担——leaderboard 用 GitHub Issues + automated CI 验证，不需要 dedicated server；(4) 如果 18 个月内没法找到 sponsor，主动 sunset 而不是质量下降地维护——保留作为 research artifact reference，不再 active update。

Q19: "Benchmark + 3 个 RQ 听起来 18 个月做不完。你的 prioritization？"

A: 同意。我的 sequencing 是——Phase 1（前 3 个月）只做 benchmark + RQ1 preliminary（在 benchmark 上测 architectural prior thesis）。RQ2 和 RQ3 作为 Phase 2+ 的 follow-up，需要 benchmark 立住后才有 publication 载体。如果 18 个月只能 ship 一件事，我会选 benchmark 而不是 RQ——因为 RQ 单做是单篇 paper，benchmark 立住后整个 lab 在后续几年都能受益。

Part D · 演讲前 24 小时 checklist¶

Appendix · 关键术语中英对照¶

中文	English
风险闸门	risk gate
生成器与验证器分离	generator-verifier separation
时间切片稳定性	time-slice stability
微结构闸门	microstructure gate
滚动窗口回测	walk-forward / rolling backtest
自适应状态机	adaptive state controller
阶梯式仓位	staged sizing
公正性验证	conformal prediction
架构性归纳偏置	architectural inductive bias
知识基底	cognition base
多重检验校正	multiple-testing correction
通缩夏普比率	Deflated Sharpe Ratio
缩水夏普比率（概率版）	Probabilistic Sharpe Ratio
开放世界	open-world
分布漂移	distribution shift / drift
物理不可删	physically append-only / DELETE-prevented
减仓限定	reduce-only
幂等订单 ID	idempotent client_order_id

End of document.

这份文件包含 ~28 min talk script（Chinese-primary, English terms preserved）+ 29-slide PPT outline（v2，benchmark-centered）+ Q&A 准备（19 题，含 Q14-Q19 benchmark 追问）+ 演讲前 checklist + 15 张 backup slides。

v2 关键变化（2026-05-18 修订）：第三幕从"3 个 abstract RQ"重构为"Crypto-Alpha-Bench 作为 concrete proposal，3 个 RQ 作为 use cases"。这个改动 align with RESEARCH_PLAN.md 里提出的研究 contribution。

文件路径：/Users/paulweng/AI Agent/hku_talk_script_and_ppt.md 配套 repo：alpha-search-frontier-notes/