Alpha Auto Search · Three Deep Reads¶

为 HKU 汇报准备的 "Minimum Increment" 读物精读¶

三篇必读 cluster，每篇都会在汇报 Q&A 里被问到。读完后能让你的第二幕和第三幕"硬一个台阶"。

Deep Read 1 · Bailey/López de Prado 系列 + Harvey-Liu "Lucky Factors"¶

Why this matters for 汇报：李教授会问 "你的 walk-forward 12 fold 怎么处理 statistical significance？" 韩教授可能问 "你说 ML 在金融里过拟合，怎么严格证明？" 这两个问题都需要 PBO / DSR / Lucky Factors 框架作为标准答案。

1.1 The Probability of Backtest Overfitting (Bailey, Borwein, López de Prado, Zhu · 2013)¶

核心想法：

传统 cross-validation 假设 IID，金融时序根本不满足。所以传统 in-sample / out-of-sample 切分给出的统计检验都是 "looking right but overconfident"。作者提出一个新的 metric——Probability of Backtest Overfitting (PBO)——直接估计"我观察到的 in-sample 表现里，其 OOS 表现会跌出 median 的概率有多大"。

方法核心：Combinatorially Symmetric Cross-Validation (CSCV)

把回测样本切成 S 段（S 通常 8-16），从中选 S/2 段当 train、剩下 S/2 当 test。一共有 C(S, S/2) 种组合，每个组合下选出 "in-sample best 策略" 在 test 上的 rank。如果 best in-sample 在 test 上经常 rank > median，PBO 就高。

关键公式（你能 quote 的）：

PBO = P(rank_test(strategy*_train) < N/2)

其中 strategy*_train 是某个组合下 train 集上 best 的策略，N 是策略候选数。

对你工作的直接 implication：

你的 LightGBM threshold 0.36 在 val 期间 MTM +44.65、test 期间 -86.87——这就是经典的 backtest overfitting，需要直接报告 PBO 数字而不是只 narrate
12 fold walk-forward 太少，无法跑严格 CSCV（需要 S ≥ 8 才有 C(8,4)=70 种组合，能形成 reasonable PBO 估计）；扩展到 100 fold 后能跑 CSCV with S=16 → C(16,8)=12,870
同样问题适用于 Optuna 500 trial——可以把 trial 看作 N=500 strategy candidate，做 CSCV

汇报 Q&A 可以这么答：

"我的 walk-forward setup 当前 fold 数不足以做严格 PBO。下一步要扩到 100+ fold + Combinatorially Symmetric Cross-Validation。我对 LightGBM val→test 反转的诊断是定性观察——time-slice stability 是真瓶颈——但定量上需要 PBO 来支持这个 claim。"

1.2 Deflated Sharpe Ratio (Bailey, López de Prado · 2014)¶

核心想法：

如果你跑了 N 个策略候选，挑出最高 Sharpe 的那个，这个 Sharpe 包含两层 inflation—— 1. Selection bias：你从 N 个里挑了最高，所以期望值偏高 2. Non-Normality：金融收益分布有 fat tail，标准 Sharpe 公式假设 Normal

DSR 公式校正这两层，给出"如果 null hypothesis 成立，观察到的 Sharpe 值有多 unusual"。

关键公式（简化版）：

\[DSR = \Phi\left[\frac{(\hat{SR} - SR_0) \sqrt{n-1}}{\sqrt{1 - \hat{\gamma}_3 \hat{SR} + \frac{\hat{\gamma}_4 - 1}{4} \hat{SR}^2}}\right]\]

其中： - \(\hat{SR}\) = 观察到的 Sharpe - \(SR_0\) = null hypothesis 下的 Sharpe（通常 0 或某个 reference） - \(\hat{\gamma}_3, \hat{\gamma}_4\) = 收益的 skewness 和 kurtosis（处理 non-Normality） - \(\Phi\) = Normal CDF - n = 样本大小

关键 insight：当 N 越大（候选越多），\(SR_0\) 应该被替换成 \(E[\hat{SR}_{max}]\)（the expected max Sharpe under null）——这就是 selection bias 校正部分。

对你工作的 implication：

你 M8.6 选出 "easy symbol whitelist" 的过程本质是 selection——526 symbol × 4 offset 里挑出 4 个 whitelist。严格地讲，每个 whitelist symbol 的 Sharpe 都应该 deflate
当前 4-tier 分类（whitelist / light sweep / no fill / avoid）的判定标准是经验阈值（3 buy fill / 100% TP / 0 stop loss）。这个标准对应的 implicit Sharpe threshold 没有 DSR 校正
升级建议：每个 fold 报告 DSR(symbol, offset) → 用 DSR > critical value 替代当前经验阈值

汇报 Q&A 可以这么答：

"我目前 4-tier 分类用的是 path-quality 指标——TP fill rate、stop-loss count、timeout count——这些指标在工程上更 actionable。下一步要把 DSR 加进去作为 secondary filter，特别是处理 selection bias 这一层。"

1.3 ...and the Cross-Section of Expected Returns (Harvey, Liu, Zhu · 2016 RFS)¶

核心论断：

"Given hundreds of papers and hundreds of factors, it does not make economic or statistical sense to use the usual significance criteria (t > 2.0). A newly discovered factor needs to clear a much higher hurdle, t > 3.0."

"鉴于几百篇论文和几百个 factor，使用通常的显著性标准（t > 2.0）既不合经济意义也不合统计意义。新发现的 factor 应该 clear 更高的门槛，t > 3.0。"

多重检验框架：

作者把 1967 到 2014 年的 313 个 published factor 做时间序列分析，给出"在每个时点上一个 honest 的 t-statistic cutoff 应该是多少"——这个 cutoff 随时间在涨，因为累积 testing 数在涨。

对你工作的直接 quote-value：

这一篇是给汇报开场或第一幕收尾最好的 quote material。可以这么用：

"Harvey, Liu, and Zhu (2016) argued that after decades of data mining, a newly discovered factor needs t > 3.0, not t > 2.0. My production system implements this insight not via t-test, but via a multi-layer verifier stack——deterministic risk gate, walk-forward with embargo, adaptive state controller with conservative promotion rule (3 clean reads required). The philosophy is the same: be deeply skeptical of single-observation evidence."

这种把"自己的工程实现"映射到"学术界标准要求"的论证，对学术评审极其有效。

1.4 Lucky Factors (Harvey, Liu · 2021 JFE)¶

核心方法：用 bootstrap with re-shuffling 框架，把 N 个候选 factor 的 t-stat 分布在 null hypothesis 下 simulate 出来，然后判断 observed t-stat 是否 lucky。

和 DSR 的区别：DSR 是 closed-form correction（假设 normal-like 分布），Lucky Factors 是 simulation-based（更 robust 但更贵）。

对你工作的 implication：

你的 LightGBM/Optuna 实验本质就是在做 multiple testing。如果想做严格论证，可以：

跑 N=500 个随机 hyperparameter 配置作为 null distribution（labels shuffle 过的对照组）
比较 observed best Sharpe 的 percentile rank
如果 observed < 95^th percentile of null，承认 "may be lucky"

这种 bootstrap 框架适合你的 walk-forward setup——计算开销可控（不需要重新跑实盘）。

1.5 Replicating Anomalies (Hou, Xue, Zhang · 2020 RFS)¶

核心数据点：

retest 447 published anomaly，发现——

标准	失败率
t < 1.96（single test 5%）	65%
t < 2.78（multiple-testing-aware 5%）	82%
trading frictions category 失败率	96%

对 RQ3 Cognition Base 的直接 implication：

不能把 published anomaly 直接当 cognition base 内容。82% 的 published anomaly 通不过严格 multiple testing。

构建 Cognition Base 的正确做法：

从 JKP (Jensen-Kelly-Pedersen) factor library 起步——这是相对最 trustworthy 的 replication-aware 起点
每个 anomaly entry 必须附带 replication status：{verified by Hou-Xue-Zhang / verified by JKP / verified independently / unverified}
Researcher Agent query Cognition Base 时，priority weight 应该按 replication strength 排——避免 Researcher 提议建立在 "false anomaly" 之上的假设

汇报 RQ3 这一页可以加一句：

"我提议的 Cognition Base 不是 published anomaly 的 mere collection——按 Hou-Xue-Zhang 2020 的发现，65-82% 的 published anomaly 经不起严格 retest。Cognition Base 必须是 replication-aware structured knowledge base，每个 entry 有 explicit replication strength weight。"

1.6 这一个 cluster 的"30 秒 summary"（汇报时用）¶

"PBO 估计 backtest overfitting 概率；DSR 校正多重检验下的 Sharpe selection bias；Harvey-Liu-Zhu 2016 提出 t > 3.0 而不是 t > 2.0；Lucky Factors 用 bootstrap 做 multiple-testing；Hou-Xue-Zhang 2020 retest 447 anomaly 发现 82% 失败严格 multiple testing。

把这套框架应用到我的 walk-forward setup：current 12 fold 不够做严格 PBO，扩展到 100 fold + CSCV 是 next step；每个 fold 报告 DSR 替代经验阈值；RQ3 Cognition Base 必须是 replication-aware 的。"

Deep Read 2 · Time Series Foundation Models + Encoding Recurrence¶

Why this matters for 汇报：RQ1 是关于 architectural priors，必须把它 positioning 在 (Chronos / TimesFM 大模型 zero-shot) 和 (Encoding Recurrence 小模型 inductive bias) 两条 contemporary 路线之间，否则 RQ1 会被批评 "outdated"。

2.1 Chronos (Amazon · 2024)¶

核心 idea：把时序当语言模型问题。

连续值 → 离散 token：scale 后 binning 量化
T5-style encoder-decoder Transformer
在大量公开 + 合成时序数据上预训练
Zero-shot 直接用于未见过的 dataset

优势： - 零样本能力——给一个新的 crypto symbol，不需要 finetune 就能预测 - 量化路线让"时序 token"可以套用 NLP 的 in-context learning

劣势（这是 RQ1 的钩子）： - 量化是有损的——15s OHLCV 的高频信息可能在 token 化时丢失 - 没有 architectural prior——把所有 series 当同质数据 token，crypto 微结构的特殊结构（bookTicker reactivity、orderflow imbalance）在 tokenization 里被压平 - 训练数据偏 generic time series——金融高频时序在预训练里只占很小比例

2.2 TimesFM (Google · 2024)¶

核心 idea：Decoder-only Transformer 在 Google 内部时序数据预训练。

和 Chronos 的对比： - TimesFM 是 decoder-only（更像 GPT），Chronos 是 encoder-decoder（更像 T5） - TimesFM 训练数据更大但不公开 - 两者都走 "时序 = language" 路线，都缺少 architectural prior

2.3 Moirai (Salesforce · 2024)¶

核心创新：Any-variate attention——能同时处理任意数量变量的时序。

训练在 LOTSA 数据集（27 billion observations）
对 multivariate time series 友好
仍然是 generic foundation model 路线

对你 RQ1 的意义：Moirai 表明 multivariate 处理本身可以做架构创新，但仍然没有针对金融微结构的 prior。

2.4 Lag-Llama (2024)¶

核心：第一个 probabilistic time series foundation model。

输出是 distribution 而不是 point estimate
这一点对金融特别相关——你做风控需要的是 distribution 上的尾部估计，不是 mean prediction

对 RQ2 的意义：probabilistic forecast + conformal prediction 是天然的组合。

2.5 PatchTST (Nie, Nguyen, Sinthong, Kalagnanam · ICLR 2023)¶

核心 idea： - Univariate patching：把每个 channel 独立切成 patch - Channel-independence：每个 channel 单独建模，不在 channel 间做 attention - 简单结构反而打败复杂时序 Transformer

和 Chronos 的对比：PatchTST 是 supervised（每个 task 单独训），Chronos 是 self-supervised foundation model。PatchTST 证明"task-specific + 好的 architectural prior > foundation model on generic data"——这正好是 RQ1 的 working hypothesis 的直接证据。

2.6 TimesNet & TimeMixer¶

TimesNet：按 dominant frequency 切 series，每段单独建模。频域归纳偏置作为 architectural prior。
TimeMixer：MLP-based + 多尺度 series decomposition mixing。证明不需要 attention，MLP + 好 prior 也能强。

2.7 Position: There are no Champions in Long-Term TSF (2025)¶

核心论断： - benchmark 不同，winner 不同 - 没有 universally best model - 意味着 architectural prior 研究空间还很大

这一篇是 RQ1 研究价值的反证——正因为 no champion，inductive bias 研究值得做。

2.8 Encoding Recurrence into Transformers (Huang, Lu, Cai, Qin, Fang, Tian, Li Guodong · ICLR 2023 Oral, Top 5%)¶

Paper 详细：

Authors: Feiqing Huang, Kexin Lu, Yuxi Cai, Zhen Qin, Yanwen Fang, Guangjian Tian, Guodong Li Venue: ICLR 2023, Oral (Top 5%)

核心 contribution：

REM (Recurrence Encoding Matrix)：把 RNN layer 拆解成 lightweight positional encoding matrix
RSA (Self-Attention with Recurrence)：把 REM 注入 multihead self-attention
Gated mechanism：data-driven 控制 recurrent vs non-recurrent 信号的比例

Key insight (你要 quote)：

"By encoding recurrence as a structural property of attention, RSA achieves better sample efficiency than baseline Transformers, while preserving the ability to model non-recurrent signals through self-attention."

"通过把 recurrence 编码为 attention 的结构性属性，RSA 在保留 self-attention 建模 non-recurrent signals 能力的同时，获得了比 baseline Transformer 更好的 sample efficiency。"

Sample efficiency 这一点对你 RQ1 极其重要——

Crypto 高频数据虽然量大，但 regime stability 短（你 LightGBM 实验证明）
因此 sample efficiency > 总训练量
Architectural prior 提供 sample efficiency，而 foundation model 路线（Chronos / TimesFM）赌的是大数据弥补 prior 缺失

RQ1 可以 positioning 为：

"Li et al. (2023) showed that architectural prior > model capacity in standard time series forecasting. Foundation model approaches (Chronos, TimesFM, Moirai) bet on the opposite——that enough generic data + scale eliminate the need for architectural prior. My RQ1 hypothesis is that for high-frequency crypto microstructure, where regime stability is short and microstructure-price recurrence is structurally non-trivial, the architectural-prior route is more promising than scaling. The contribution would be a microstructure-aware extension of REM."

2.9 这一个 cluster 的"30 秒 summary"¶

"时序 foundation model 这两年集中爆发——Chronos / TimesFM / Moirai 走 generic foundation 路线，PatchTST / TimeMixer / Encoding Recurrence 走 architectural prior 路线。2025 position paper 表明前者没有 champion model。Li Guodong 2023 ICLR Oral 证明 architectural prior > model capacity。我 RQ1 是把 Encoding Recurrence 思想扩展到 crypto 微结构——具体形式是建模 microstructure-price 跨尺度 recurrence。"

Deep Read 3 · AlphaEval (arXiv 2508.13174 · 2025)¶

Why this matters for 汇报：你的 walk-forward methodology 需要一份学术级 evaluation reference 做 anchor。AlphaEval 正好是 2025 年最新出的"formula alpha mining 统一评估框架"。

3.1 Paper 详细¶

Title: AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining Authors: Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang arXiv: 2508.13174 (2025-08-10) Code: github.com/LeoDingggg/AlphaEval

3.2 核心问题¶

"Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters."

作者识别的现状问题： - 现有 alpha mining 工作都用各自的 backtest setup 评估，不可比 - Backtest 本身是 expensive 且 sequential（不能并行） - 单一 metric（IC 或 Sharpe）丢失大量信息

3.3 AlphaEval 的 5 维评估¶

维度	含义	你 M8.6 对应物
Predictive Power	因子对未来收益的预测能力	你的 IC + MTM PnL
Stability	在不同子样本/时间窗口的稳定性	你的 walk-forward across 12 fold
Robustness to Market Perturbations	市场状态扰动下的鲁棒性	你的 microstructure gate / adaptive state
Financial Logic	是否对应合理经济机制	你的 self-evolution reference 里强调的"必须有经济故事"
Diversity	和现有因子库的正交性	你的项目这部分弱（因为是单策略，不是 portfolio）

3.4 关键创新：Backtest-free Evaluation¶

AlphaEval 的核心 selling point 是不跑回测就能评估——通过 5 个维度的 surrogate metric 并行计算，effectiveness comparable to backtest 但成本低几个量级。

对你工作的 implication：

你的 walk-forward backtest 是 gold standard，但成本高（526 symbol × 12 fold × 4 offset = ~25k 实验单元）
可以用 AlphaEval surrogate 作为粗筛——把 25k 实验单元降到 100k+ 候选 → AlphaEval 粗筛留 5%（5k） → walk-forward 精筛
这正好对应 ASI-ARCH 的 Exploration → Verification 两阶段调度

3.5 5 维评估对你 RQ 的 mapping¶

AlphaEval 维度	对应 RQ
Predictive Power	RQ1（architectural prior 直接提升 predictive power）
Stability	RQ1（time-slice stability 是 RQ1 的核心问题）
Robustness	RQ2（open-world reliability 的一种形式）
Financial Logic	RQ3（Cognition Base 是 financial logic 的 institutional memory）
Diversity	RQ3（Cognition Base 帮助提议新方向）

这种 mapping 让你的 RQ 可以在 AlphaEval framework 内重新表述——汇报时引一句 AlphaEval（2025 年最新工作），立刻拉到 SOTA grounding。

3.6 这一个 cluster 的"30 秒 summary"¶

"AlphaEval (arXiv 2508.13174) 是 2025 年最新的 alpha mining unified evaluation framework，5 维评估（predictive power / stability / robustness / financial logic / diversity）并行计算，不需要跑回测。我的 M8.6 walk-forward 是 gold-standard verifier，AlphaEval 可以作为 coarse-grained pre-filter——两者组合对应 ASI-ARCH 的 exploration → verification 两阶段。我的三个 RQ 在 AlphaEval 5 维框架里都有清晰对应。"

整合到汇报的具体修改建议¶

修改 1 · 第一幕 Slide 12（Negative Result）¶

原版：

"The bottleneck is time-slice stability, not model capacity."

升级版：

"The bottleneck is time-slice stability, not model capacity. By Bailey-López de Prado (2014) framework, this is a classic case of multiple-testing bias under non-Normality + selection. My current 12 fold walk-forward sample is insufficient for strict Deflated Sharpe Ratio; scaling to 100+ fold + CSCV-based PBO (Bailey-Borwein-López de Prado-Zhu 2013) is the next step."

加这一段话后，整个 negative result 从"经验观察"升级为"标准方法学下结论"。

修改 2 · 第二幕 Slide 18（To Prof. Li）¶

原版只引了 Encoding Recurrence 一篇。升级版加入对照——

"Li et al. (2023) showed architectural prior > model capacity in time series. 2024-2025 foundation model wave (Chronos, TimesFM, Moirai) bet on the opposite hypothesis——that scale + generic data eliminate the need for prior. 2025 position paper (No Champions) suggests neither has won. My RQ1 is to test the architectural-prior thesis in crypto microstructure, where regime stability is short and microstructure-price recurrence is structurally non-trivial."

修改 3 · 第三幕 Slide 24（RQ3 Cognition Base）¶

原版没有 anomaly replication 视角。升级版加一段——

"Cognition Base is not a mere collection of published anomalies. Hou-Xue-Zhang (2020) retest 447 published anomalies and find 82% fail strict multiple-testing standards. Cognition Base must be replication-aware structured knowledge——each entry weighted by replication strength (verified by HXZ / JKP / independent retest / unverified). RQ3 is partly about how to build this weighting framework."

修改 4 · Q&A 准备 · 新增题目¶

Q14: "你做过 Deflated Sharpe 吗？"

A: 没做过严格 DSR，因为我的 12 fold sample 不足以 estimate non-Normality 参数（skewness / kurtosis）的稳定值。当前用的是 path-quality 经验 filter（TP fill rate 100%、stop-loss 0、timeout ≤ 1）。下一步路线图——扩展到 100+ fold 后做完整 DSR + CSCV-based PBO，把当前经验 filter 校正成 DSR > critical value。这正是希望进入研究环境后能严格做的方向。

Q15: "你看 Chronos / TimesFM 这些 foundation model 在金融上的表现怎么样？"

A: 看过 zero-shot 评估 paper，结论是 generic foundation model 在金融数据上表现不一致——某些 frequency / horizon 上 reasonable，但 high-frequency tick 上明显逊于 task-specific model。我的解释是 tokenization 在 high-frequency 上是有损的，且预训练数据里金融占比很低。我 RQ1 的 working hypothesis 是 architectural-prior 路线对 crypto 微结构更适合，但这是 testable hypothesis，不是 conviction。

Q16: "Hou-Xue-Zhang 2020 那个 82% 失败率，对你 Cognition Base 的实际操作意味着什么？"

A: 三件事——(1) 起步用 JKP factor library 而不是 raw published anomaly 集合；(2) 每个 Cognition Base entry 必须有 replication metadata；(3) Researcher Agent 提假设时，priority weight 应该考虑 replication strength——避免把"learned hypothesis"建立在"likely false anomaly"上。这是把 multiple-testing rigor 注入到 AI4Science 范式的具体形式。

End-of-Read 总结¶

三个深读 cluster 的核心 takeaway：

López de Prado / Harvey-Liu 这一线给你的是"如何把经验观察讲成方法学结论"的语言 —— PBO、DSR、t > 3.0、Lucky Factors。读完后你的负结果讨论瞬间从"诚实记录"升级为"严格诊断"。
时序 foundation model 这一线给你的是"如何 positioning RQ1"的对照面 —— Chronos / TimesFM 代表 scaling 路线，Encoding Recurrence 代表 prior 路线，2025 position paper 表明 no champion。你的 RQ1 站在两者之间但偏向 prior，且把 prior 概念扩展到跨尺度 recurrence。
AlphaEval 给你的是"统一评估框架"的最新 reference —— 5 维评估，backtest-free，2025 SOTA。读完后你的 walk-forward 可以 positioning 成 AlphaEval 的 gold-standard verification，且三个 RQ 都能在 AlphaEval 框架内对齐。

读完这三个 cluster 后，汇报第一幕（negative result 升级）、第二幕（slide 18 升级）、第三幕（slide 24 升级）的修改建议如上。预计周四前能 incorporate 完。

End of Three Deep Reads.