Crypto-Alpha-Bench · Risk Analysis¶

Critical examination of the Crypto-Alpha-Bench proposal — what could go wrong, what to mitigate, and what to validate before committing.

2026-05-18 · Paul Weng

0. Why This Document Exists¶

RESEARCH_PLAN.md makes the case for Crypto-Alpha-Bench. This document is the deliberate counter-perspective — the risks I'd raise if I were the toughest reviewer of my own proposal. Three lenses: (1) academic peer-reviewer, (2) seasoned researcher who's seen benchmarks fail, (3) risk-conscious VC.

If you can't list the failure modes for an ambitious idea, you don't really understand it.

1. Seven Risks · Severity & Mitigation¶

Risk 1 · 我可能不是这件事的最佳人选（HIGH severity）¶

问题：为什么是单人 single-developer 做这个 benchmark，而不是 GAIR-NLP / Stanford AI4Finance Lab / DeepMind？他们有现成的 community 影响力、机构 sponsor、跨 paper 协调能力。

Evidence this is real： - AlphaEval (arXiv 2508.13174, 2025-08) 已经是同方向尝试，由 GAIR sister group（北大 ML & DM team）出 - FITEE 2025 LLM-based alpha mining survey 由国内 quant 团队主导 - 这些团队离 Crypto-Alpha-Bench 这一步可能只差几周到几个月

Mitigation： - Lean into unique differentiator：production-grade tradability infrastructure（adaptive state controller、microstructure gate、3-clean-read promotion）。学术团队 6-12 个月复制不出来。这是真护城河 - Speed > polish：v0 越早 ship 越好，占住 framing position - Open from day 1：community contribution model 让潜在竞争者变成 collaborator

Realistic assessment: Partial mitigatable. 学术团队的优势真实存在，但 production verification infrastructure 是我有他们没有。

Risk 2 · Crypto-only 是双刃剑（MEDIUM severity）¶

问题：金融 ML 学术主流是 US equity / China A-share。Crypto-only benchmark 可能被审稿人当作 "narrow setting"。

典型审稿质疑： - "你方法在 crypto 上 work，怎么证明在 equity 上 work？" - Crypto microstructure 和 equity 显著不同（24/7、无 circuit breaker、funding rate、永续合约 vs spot）

Mitigation options:

Option	Pro	Con
A · 接受 narrow positioning	窄但深；和 LLM/foundation model 交叉点最 fresh	受 equity 审稿人质疑
B · v0 crypto + v1 加 equity	影响力大；多 venue 适配	成本高，慢一倍
C · 从一开始 multi-asset	Ceiling 最高	风险也最高，scope creep

Recommendation: A. Crypto narrow 反而是优势——它是 LLM-driven + foundation model + microstructure intersection 最 fresh 的场景，equity 上学术工具已经成熟。

Mitigation in talk: Slide 22 motivation 里加一句 "I focus on crypto perpetuals not because equity is solved, but because crypto's open-world signature makes it the cleaner testbed for the open questions."

Risk 3 · Verifier 可能已经在过拟合（HIGH severity）¶

问题 — 这是最 subtle 也最危险的：

我的 adaptive state controller、microstructure gate（5 维）、3-clean-read promotion——这些 design decision 都是基于在 Binance USD-M 数据上观察到的现象做的。它们已经隐式 fit 到 Binance USD-M 的特定 statistical regularities 上。

把它包装成 "standard tradability gate baseline" 后，可能被质疑：

"你的 baseline 本身已经 leak 了 benchmark 的 test 信息。你说 microstructure gate 用 5 个 dim 筛选——这些 dim 是怎么 chosen 的？从你做产品时的经验来的，对吧？那它们已经 implicitly 'trained' 在 Binance 历史数据上了。"

这是非常 sharp 的反驳，且不容易回应。

Mitigation: - Audit trail discipline：把 microstructure gate 5 dim 和 adaptive state 参数的选择历史全部公开，commit history + design rationale doc - Mandatory sensitivity analysis：benchmark v0 的 protocol 强制要求报告 gate / state controller 的 hyperparameter sensitivity - Held-out future window enforce：v0 用 2022-2024 训 baseline，2025 是 untouched holdout - Acknowledge openly in paper：把这个 risk 当 limitation 写明白，比被审稿人挖出来好

Realistic assessment: Hard to fully mitigate. 真诚 acknowledge + 严格 documentation 是最佳防御。

Risk 4 · Cost model 三档可能 establish 不下来（MEDIUM severity）¶

问题： - Optimistic / realistic / pessimistic 的具体参数（slippage bps / queue priority / fill probability）没有 ground-truth——是 model 不是事实 - 不同 exchange、不同 time period、不同 symbol 的 cost reality 不同 - 学术 transaction cost analysis 文献（Almgren-Chriss / Hasbrouck）都是 US equity 的，不一定 transfer

Mitigation: - 从真实 fills 数据反推：用 Binance 历史 fills（公开 API）做 empirical calibration - Document the calibration process publicly - 三档之间预留 community PR space：v0 给一组默认值，v1+ 接受 community-contributed cost model variants - Don't over-claim: 明确说 "approximates Binance USD-M conditions, transfer to other venues requires re-calibration"

Risk 5 · Compute-controlled budget 在实践中难 enforce（MEDIUM severity）¶

问题： - LLM-driven 用 API（token 计量），GP 用本地 CPU（wall-clock 计量），不可比 - 不同硬件 (A100 vs H100 vs M2) 同 GPU-hour 实际算力差几倍 - LLM caching 让 token 计量也不准

学术 benchmark 历史上 MLPerf 做过 compute-controlled，但需要 dedicated team。单人维护做不到 MLPerf 级别。

Mitigation: - Wall-clock + 标准化硬件：要求 baseline 在 same hardware spec 下跑（推荐 A100 single GPU） - 报告 token-cost + wall-clock 两个 metric，让 community 选哪个更 fair - 三档 budget tier (small/medium/large)：参与者自选，分别排 leaderboard - Accept imperfect: 有比没有强

Risk 6 · 18 个月内可能不够（MEDIUM severity）¶

问题：NeurIPS Datasets & Benchmarks track 接受率 ~25-30%，每年只一次 deadline：

2026 NeurIPS 截稿 ≈ 5 月（已过）
2027 NeurIPS 截稿 ≈ 5 月
如果 2026-05 才 ship v0，需要 18 个月 commit 这一件事

Mitigation: - 多轨发表 plan：NeurIPS D&B 最优，备选 ICLR Benchmarks / ICML Datasets / AAAI / KDD - Workshop 先发：NeurIPS 2026 Workshop on AI in Finance 发 v0，main track 投 v1 - Methodology paper 平行做：RQ1 (architectural prior) 同期发主流 venue 对冲

Risk 7 · 学术 ↔ Industry 张力（HIGH severity, depends on advisor feedback）¶

问题：HKU 是学术机构，academic incentive 和 community benchmark 可能不完全一致：

学术评价：NeurIPS / ICLR / Nature paper > GitHub stars > Community adoption
Benchmark 真实 impact 在 community adoption + GitHub，paper 只是 byproduct
如果导师推 "先发 method paper，benchmark 之后再说"，timeline 被打乱

Mitigation: - 汇报现场直接测试：第三幕 ask 部分明确问 "benchmark vs method paper 优先级"——让两位老师当场给信号 - Plan B 早就备好：fallback 是 "benchmark methodology" 作为 RQ1 method paper 的 evaluation section - 接受方向调整：如果导师强烈倾向 method-first，前 6 个月不应该和导师方向对抗

2. Composite Severity Matrix¶

Risk	Severity	Mitigatable?	Notes
1 · 不是最佳人选	HIGH	Partial	靠 unique infra + speed
2 · Crypto-only narrow	Medium	Easy	显式 frame
3 · Verifier overfitting	HIGH	Hard	严格 documentation + acknowledge
4 · Cost model 主观	Medium	Medium	Empirical calibration + community PR
5 · Compute control 难	Medium	Easy	Accept imperfect, multi-tier budget
6 · 18 个月不够	Medium	Medium	多轨发表 + workshop 先发
7 · 学术↔Industry 张力	HIGH	需要导师 feedback	汇报现场测试

Highest-priority risks: 1 + 3 + 7. 这三个是 benchmark idea 的真 threat，其他都是 manageable。

3. The ETH Validation Experiment（强烈推荐）¶

在 commit benchmark 之前，先用 1-2 周做一件 conviction 验证实验。

Setup:

用现有的 walk-forward pipeline，在 ETH/USDT 而不是 BTC/USDT 上跑一遍 LightGBM / Optuna 实验。

Hypothesis to test:

"Time-slice stability bottleneck"（val 强 test 弱）是 BTC-specific 现象，还是 crypto microstructure 的普遍特征？

Two possible outcomes:

Outcome	Implication
ETH 也 reproduce val→test 反转	Benchmark motivation strengthened——这不是 BTC 偶然，是 crypto microstructure 的结构性问题。Conviction +50%
ETH 上 ML 表现 OK	Benchmark motivation 需重新审视——可能 BTC 有 specific 反常因素，需要找出来。Conviction -30%

Cost: 1-2 周（infrastructure 已有，只是换 symbol 重跑）

Value: 远早于汇报得到 conviction 信号；汇报时可以说 "I've validated this phenomenon on 2 independent symbols"。

实施 checklist: - [ ] 确认 ETH/USDT 数据完整性（≥480 unique 15s bar 覆盖 walk-forward 全窗口） - [ ] 跑同样的 LightGBM threshold sweep - [ ] 跑同样的 Optuna 500 trial parameter search - [ ] 报告 val period MTM、test period MTM、chronological consistency - [ ] 如果结果有 nuance（比如 ETH 部分 fold reproduce、部分不），文档化 diagnostic

4. Final Verdict¶

Do I still recommend the benchmark proposal? Yes — with three conditions.

Condition 1: 先做 ETH validation experiment（risk 3 的部分 hedge）

Condition 2: 汇报时主动 acknowledge risk 1 和 3，不要藏着等审稿人挖出来

Condition 3: Plan B 真正 functional——如果导师 strongly 倾向 method-first，接受调整。RQ1 作为 standalone paper 是合法 fallback

Bottom line:

Crypto-Alpha-Bench is still a high-leverage move. The risks are real but mitigatable. The biggest non-mitigatable variable is faculty feedback at the talk — Han/Li 可能给出超出我预期的方向性调整。Stay open.

5. Next-Action Decision Tree¶

是否做 ETH validation experiment?
├─ Yes (强烈推荐)
│  ├─ ETH reproduce → conviction up, proceed with benchmark plan
│  └─ ETH 反转 → re-examine benchmark motivation, may pivot to RQ1-only
│
└─ Skip
   ├─ 风险: 汇报时 conviction 单薄（只有 BTC 一例）
   └─ 唯一合理理由: 时间太紧来不及

汇报后导师反馈:
├─ Benchmark idea 接受
│  ├─ 直接进入 12-week roadmap
│  └─ Risk 1/3/7 全部 mitigate 进 v0 protocol
│
├─ "Benchmark 太大，先做 method"
│  ├─ Accept, 转 RQ1 standalone
│  └─ Benchmark methodology 作为 RQ1 paper 的 evaluation section
│
└─ "Benchmark 不是研究 contribution"
   ├─ 罕见，但可能（取决于导师传统）
   └─ Plan C: 完全转 method paper 路线，benchmark idea 后续以 workshop 形式发

6. What I'm NOT Worried About¶

为了 balance，列一下我没有列入 risk 但其他人可能担心的事，以及为什么我不担心——

"Crypto 数据不能公开发布"：Binance 历史数据公开 API 可获取，distribution license-friendly。我清洗对齐的版本可以 BSD/MIT/Apache license 发
"Benchmark 维护负担"：单人 18 个月可以；之后找 sponsor 或 sunset。Imperfect 但 manageable
"LLM API 价格让 baseline 不可复现"：缓存 + 本地小模型 baseline 兜底；OpenSeeker 范式证明小模型 SFT 可达 frontier 性能
"金融领域审稿人保守"：Datasets & Benchmarks track 评审更看 protocol 严谨性，不是 finance domain expert dominated

End of risk analysis. Time to think, then decide.