Alpha Auto Search · Expanded Deep Reads¶

Purpose: expand the three "minimum increment" deep-read clusters into a reusable research memo. Use this as the serious version behind alpha_search_deep_reads.md.

Updated: 2026-05-19

0. Executive Map¶

The current deep-read set has three jobs:

Make the negative result academically legible
The Bailey / López de Prado / Harvey-Liu / Hou-Xue-Zhang line gives the language for saying: "this is not just a bad ML experiment; this is a multiple-testing and overfitting pathology."
Position RQ1 correctly
The time-series foundation model wave gives the "scale + generic pretraining" side; PatchTST / TimesNet / TimeMixer / Encoding Recurrence give the "architectural prior" side. Your RQ1 is the finance-microstructure version of that debate.
Turn your verifier into a benchmark proposal
AlphaEval shows that alpha mining needs more than IC/backtest. Crypto-Alpha-Bench can use AlphaEval-like dimensions but add fixed data, costs, compute budgets, synthetic ground truth, DSR/PBO, and tradability.

Recommended speaking hierarchy for HKU:

Mainline: Crypto-Alpha-Bench
Method fallback: RQ1 microstructure recurrence
Statistical rigor hook: PBO / DSR
Differentiator: production-grade tradability verifier + optional human expert baseline

1. Cluster A · Backtest Overfitting & Multiple Testing¶

A1. Bailey, Borwein, López de Prado, Zhu · The Probability of Backtest Overfitting¶

Sources:

PDF: https://www.davidhbailey.com/dhbpapers/backtest-prob.pdf
Journal reference summary: https://colab.ws/articles/10.21314%2FJCF.2016.322

Core Problem¶

Financial research often evaluates many strategy variants and reports the best backtest. Standard train/test splits understate the false-discovery risk because:

returns are non-IID;
strategy candidates are correlated;
the researcher adaptively changes the search space;
the "best" in-sample rule is selected after seeing many alternatives.

The paper's central contribution is Probability of Backtest Overfitting (PBO): estimate how often the in-sample winner becomes a below-median out-of-sample performer.

Method Core: CSCV¶

Combinatorially Symmetric Cross-Validation (CSCV):

Split the full historical sample into S contiguous slices.
For every combination of S/2 slices as train and the complement as test:
compute performance for all candidate strategies on train;
select the in-sample winner;
rank that same strategy on test among all candidates.
Count how often the selected strategy has poor OOS rank.

Technical note:

The common summary is PBO = P(lambda <= 0), where lambda is the logit transform of the OOS relative rank of the in-sample winner.
In informal speech, you can say: "PBO estimates the probability that the in-sample winner falls below the OOS median." Avoid a rank formula unless you define whether rank 1 means best or worst.

Why It Matters For Your Work¶

Your M8.6 results have exactly the PBO signature:

many candidates: symbols × offsets × thresholds × Optuna trials;
high validation MTM;
poor chronological test MTM;
strategy selection after inspecting historical performance.

The correct research move is not "try a bigger model." It is:

quantify how much of the discovered performance survives combinatorial OOS ranking.

Concrete Integration Plan¶

For your walk-forward setup:

Step	Implementation
Candidate matrix	rows = time slices, columns = candidate strategies, values = MTM / Sharpe / path-quality score
Slice count	start with `S=8` for smoke test; target `S=16` for serious PBO
Strategy candidates	symbol-offset-rule configs or model hyperparameter configs
Metric	use net MTM first; later Sharpe/PSR/DSR
Output	PBO, OOS rank histogram, degradation curve from IS rank to OOS rank

Important correction:

Existing 12-fold walk-forward is good production discipline, but not enough for strong statistical claims.
You need more folds/slices or a CSCV-compatible strategy matrix.

HKU Soundbite¶

"My current LightGBM/Optuna result is a qualitative PBO warning sign. The next research version should report CSCV-based PBO instead of only saying validation reversed on test."

A2. Bailey & López de Prado · The Deflated Sharpe Ratio¶

Sources:

SSRN/PDF: https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2460551_code87814.pdf?abstractid=2460551&mirid=1
David Bailey PDF mirror: https://www.davidhbailey.com/dhbpapers/deflated-sharpe.pdf

Core Problem¶

The observed Sharpe ratio is inflated by:

selection bias: you selected the best among many trials;
non-normality: skewness and fat tails distort standard Sharpe inference;
short samples: high Sharpe over short samples is much less convincing.

The DSR asks:

after accounting for non-normal returns and multiple trials, is this Sharpe still statistically meaningful?

Method Core¶

DSR is built on the Probabilistic Sharpe Ratio (PSR), then replaces the benchmark Sharpe with a selection-adjusted hurdle: the expected maximum Sharpe under N trials.

Implementation ingredients:

observed Sharpe;
sample size;
skewness;
kurtosis;
number of trials;
average correlation among trials or an effective number of independent trials.

Why It Matters For Your Work¶

Your current whitelist / tradability gate uses path-quality heuristics:

TP fill rate;
stop-loss count;
timeout count;
clean reads;
adaptive state promotion.

These are production-useful, but for a paper they need a statistical sibling:

each selected symbol/offset should be tested against a selection-adjusted Sharpe hurdle.

Concrete Integration Plan¶

Use DSR as a secondary filter, not a replacement for path-quality:

Keep production path-quality filters for safety.
Compute net returns for each candidate strategy.
Estimate N_eff_trials, not raw N, because many offsets/symbols are correlated.
Report PSR and DSR alongside MTM.
In the benchmark protocol, require DSR after any candidate search process.

HKU Soundbite¶

"My current filters are operationally conservative. The research version should add DSR because the whitelist is selected from many symbol/offset candidates."

A3. Harvey, Liu, Zhu · "... and the Cross-Section of Expected Returns"¶

Sources:

RFS page: https://academic.oup.com/rfs/article/29/⅕/1843824
NBER version: https://www.nber.org/papers/w20592

Core Problem¶

The factor zoo creates a multiple-testing crisis. If hundreds of papers test hundreds of factors, the traditional t > 2 threshold becomes too permissive.

The famous practical takeaway:

a newly discovered factor needs a higher hurdle, often summarized as t > 3.0.

Method Core¶

The paper estimates how the appropriate t-stat cutoff should rise over time as the cumulative number of tested factors grows.

The important conceptual move:

significance standards should depend on the research environment's prior data-mining intensity.

Why It Matters For Your Work¶

Crypto-Alpha-Bench should not accept:

"my factor has IC > 0";
"my Sharpe beats random";
"my backtest is positive."

It should require multiple-testing-aware standards.

This paper gives the academic-finance language for that position.

Concrete Integration Plan¶

In benchmark spec:

every submitted alpha reports t-stat and multiple-testing adjusted threshold;
benchmark leaderboard distinguishes "raw winner" from "statistically credible winner";
new factors must clear a higher hurdle than a single isolated hypothesis test.

HKU Soundbite¶

"Harvey-Liu-Zhu is the reason I do not want Crypto-Alpha-Bench to be a simple leaderboard. It must encode the fact that finance has already been heavily mined."

A4. Harvey & Liu · Lucky Factors¶

Sources:

JFE page: https://www.sciencedirect.com/science/article/abs/pii/S0304405X21001410
Public PDF: https://jacobslevycenter.wharton.upenn.edu/wp-content/uploads/2015/05/Lucky-Factors.pdf

Core Problem¶

Some factors look significant because they are lucky draws from a huge search space. Closed-form corrections are useful, but simulation can reveal how often "discoveries" arise by chance under realistic dependence.

Method Core¶

Lucky Factors uses bootstrap / resampling logic to simulate the distribution of factor t-stats under a null, then asks whether observed factors remain unusual after accounting for the search process.

Compared with DSR:

Method	Style	Strength	Weakness
DSR	analytical / semi-closed-form	fast, easy to report	depends on estimated trial count and moments
Lucky Factors	simulation / bootstrap	more flexible	more expensive and design-sensitive

Why It Matters For Your Work¶

For Optuna / LLM-generated / GP-generated strategies, a simulation null is natural:

randomize labels;
randomize entry timestamps;
preserve return autocorrelation via block bootstrap;
rerun the candidate-search protocol;
compare observed best to null best.

This is stronger than just comparing to a single random strategy.

Concrete Integration Plan¶

For benchmark v0:

Add a "null search" baseline:
same compute budget;
shuffled/block-bootstrapped labels;
same alpha search algorithm.
Report whether discovered alphas exceed the 95^th percentile of the null search distribution.

HKU Soundbite¶

"The right negative control is not one random factor. It is the best factor found by the same search algorithm under a null world."

A5. Hou, Xue, Zhang · Replicating Anomalies¶

Sources:

NBER: https://www.nber.org/papers/w23394
RFS published version is listed from the NBER page.

Core Problem¶

The anomaly literature contains many published effects that fail under a common replication protocol.

Important source-verified numbers from the NBER abstract:

447 anomaly variables compiled.
286 anomalies, about 64%, are insignificant at the conventional 5% level under their replication setup.
With a t-value cutoff of 3, 380 anomalies, about 85%, are insignificant.
Liquidity variables are especially fragile: 95 out of 102 insignificant in that abstract's setup.

Note on your existing notes:

The current file uses "65% / 82%" under a multiple-testing-aware cutoff. That is directionally fine, but for a professor-facing exact quote, prefer the source-verified phrasing: roughly two-thirds fail conventional replication; around 85% fail a t=3 hurdle. If using 82%, specify the exact cutoff/protocol.

Why It Matters For Your Work¶

This is the most important paper for RQ3:

Cognition Base cannot be a pile of published anomalies.

It must be replication-aware.

Each knowledge-base entry needs fields like:

mechanism:
original paper:
asset class:
frequency:
reported t-stat:
replicated?:
replication source:
multiple-testing status:
decay evidence:
capacity/friction notes:
last verified timestamp:

Concrete Integration Plan¶

Crypto-Alpha-Bench should include:

replication metadata for any "known anomaly" baseline;
a rule that unreplicated factors are hypotheses, not ground truth;
a Cognition Base ablation:
raw published anomalies;
replication-weighted anomalies;
crypto-native microstructure mechanisms;
no knowledge base.

HKU Soundbite¶

"The first step in a financial Cognition Base is not collecting papers. It is distinguishing published claims from replicated mechanisms."

2. Cluster B · Time-Series Foundation Models vs Architectural Priors¶

B1. Chronos · Learning the Language of Time Series¶

Sources:

Amazon Science: https://www.amazon.science/publications/chronos-learning-the-language-of-time-series
Published in TMLR 2024 according to Amazon Science.

Core Idea¶

Chronos treats time series as a language-modeling problem:

scale continuous values;
quantize them into a fixed vocabulary;
train T5-style transformer architectures with cross-entropy;
forecast by sampling future tokens and dequantizing.

Why It Matters¶

Chronos is the cleanest representative of:

"time series can be handled by generic sequence modeling plus scale."

This is the strongest counterpoint to your RQ1. If Chronos works on crypto microstructure, then custom recurrence priors may be less necessary.

Crypto Microstructure Concern¶

For 15s crypto:

quantization may destroy small but tradable microstructure differences;
univariate tokenization weakens cross-channel structure;
pretraining data may underrepresent adversarial high-frequency financial series;
zero-shot forecasting may optimize point/quantile accuracy but not executable alpha after costs.

Benchmark Experiment¶

Use Chronos as a baseline:

input: close/return series first;
then engineered microstructure aggregates;
horizon: 5s / 15s / 1m / 5m / 30m;
metrics: predictive loss, directional accuracy, IC, cost-adjusted PnL, DSR/PBO.

HKU Soundbite¶

"Chronos is the scale-first baseline. If my recurrence-prior idea cannot beat or complement Chronos under strict walk-forward evaluation, the RQ1 thesis weakens."

B2. TimesFM · Decoder-Only Foundation Model for Time-Series Forecasting¶

Sources:

Google Research blog: https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/
ICML 2024 per Google blog.

Core Idea¶

TimesFM is a decoder-only Transformer foundation model for forecasting. Google reports pretraining on a large corpus of about 100B real-world time points, with strong zero-shot performance across public benchmarks.

Why It Matters¶

TimesFM is the "GPT-like" side of the time-series foundation model wave:

decoder-only;
large pretraining corpus;
zero-shot evaluation;
open model artifacts according to Google blog links.

Crypto Microstructure Concern¶

TimesFM's core bet is generic forecasting transfer. Your setting stresses it because:

crypto regimes shift quickly;
execution costs matter more than forecast RMSE;
microstructure features are multivariate and reactive;
a model trained mostly on broad time-series corpora may not encode financial adversarial structure.

Benchmark Experiment¶

Use TimesFM as:

zero-shot baseline;
optionally fine-tuned baseline if supported in the used release;
compare against LightGBM, PatchTST, and microstructure-prior model.

Do not claim "TimesFM fails in finance" without testing. Say:

"TimesFM is the fair foundation-model baseline I need to beat."

B3. Moirai · Universal Time Series Forecasting Transformer¶

Sources:

arXiv: https://arxiv.org/abs/2402.02592
ICML/PMLR: https://proceedings.mlr.press/v235/woo24a.html
Salesforce blog: https://www.salesforce.com/blog/moirai/

Core Idea¶

Moirai is a universal forecasting transformer trained on LOTSA, a large-scale open time-series archive. The key engineering ideas include:

handling arbitrary numbers of variates;
multiple patch-size projection layers;
any-variate attention;
flexible predictive distributions.

Why It Matters¶

Moirai is more relevant than univariate-only models because your market state is multivariate:

OHLCV;
spread;
depth;
funding;
OI;
cross-symbol signals.

Crypto Microstructure Concern¶

Any-variate attention handles variable count, but not necessarily:

order-book causality;
microstructure-price feedback;
adverse selection;
cost-sensitive actionability.

Benchmark Experiment¶

Moirai should be your multivariate foundation baseline:

compare univariate close-only vs full multivariate state;
compare model accuracy vs executable PnL;
test whether any-variate attention alone captures microstructure recurrence.

HKU Soundbite¶

"Moirai is the strongest generic multivariate baseline. If microstructure recurrence matters, it should show up as an improvement beyond any-variate attention."

B4. Lag-Llama · Probabilistic Time-Series Foundation Model¶

Sources:

arXiv: https://arxiv.org/abs/2310.08278
Hugging Face model page: https://huggingface.co/time-series-foundation-models/Lag-Llama

Core Idea¶

Lag-Llama is a decoder-only foundation model for probabilistic univariate time-series forecasting. It uses lagged values as covariates and outputs a distribution rather than just a point forecast.

Why It Matters¶

Financial decisions need uncertainty:

position sizing;
kill switch thresholds;
conformal prediction;
drawdown control;
tail-risk filters.

Lag-Llama is useful less as "the winner model" and more as:

a distributional baseline for RQ2 and risk-aware forecasting.

Crypto Microstructure Concern¶

Univariate probabilistic forecasts are not enough for:

cross-asset lead-lag;
order-book state;
strategy-specific labels;
action-dependent fill outcomes.

Benchmark Experiment¶

Use Lag-Llama to test:

forecast interval calibration;
whether probabilistic outputs improve risk gates;
conformal calibration on top of model quantiles.

B5. PatchTST · A Time Series Is Worth 64 Words¶

Sources:

OpenReview: https://openreview.net/forum?id=Jbdc0vTOcol
arXiv: https://arxiv.org/abs/2211.14730

Core Idea¶

PatchTST uses:

patching: segment the time series into subseries tokens;
channel independence: each channel is modeled as a univariate sequence sharing weights.

Despite its simplicity, it outperformed many earlier Transformer-based forecasting models.

Why It Matters¶

PatchTST is a methodological warning:

a small architectural prior can beat a larger generic architecture.

This supports the intellectual legitimacy of RQ1.

Crypto Microstructure Concern¶

Channel independence is both strength and weakness:

strength: sample efficiency, reduced overfitting;
weakness: microstructure features are coupled, and cross-channel interaction may be the signal.

Benchmark Experiment¶

Use PatchTST as:

supervised architecture-prior baseline;
compare channel-independent vs cross-channel variants;
ablate whether book/price recurrence requires channel interaction.

B6. TimesNet · Temporal 2D-Variation Modeling¶

Sources:

arXiv: https://arxiv.org/abs/2210.02186
OpenReview/PDF appears under ICLR 2023 materials.

Core Idea¶

TimesNet models temporal variation through multi-periodicity by transforming 1D time series into 2D representations organized around dominant periods.

Why It Matters¶

It is a frequency/periodicity prior. For crypto:

funding intervals;
session effects;
liquidity cycles;
weekday/weekend structure;
event-driven bursts.

Crypto Microstructure Concern¶

High-frequency tape reading may not be periodic. Some signals are reactive and event-driven, not seasonal.

Benchmark Experiment¶

Use TimesNet-like periodic priors for:

funding-window effects;
intraday liquidity cycles;
compare against recurrence-prior model for reactive signals.

B7. TimeMixer · Decomposable Multiscale Mixing¶

Sources:

ICLR proceedings: https://proceedings.iclr.cc/paper_files/paper/2024/hash/a7ac8a21e5a27e7ab31a5f42a0117bdb-Abstract-Conference.html
arXiv: https://arxiv.org/abs/2405.14616

Core Idea¶

TimeMixer is a fully MLP-based architecture built around multiscale decomposition and mixing:

Past-Decomposable-Mixing;
Future-Multipredictor-Mixing;
fine and coarse temporal scales disentangled.

Why It Matters¶

Your RQ1 is explicitly cross-scale:

100ms/tick order-book dynamics;
5s/15s bars;
1m/5m trend;
30m execution horizon.

TimeMixer is a strong baseline for "multiscale prior without attention."

Crypto Microstructure Concern¶

TimeMixer decomposes scales, but may not encode the feedback loop:

order flow changes price; price changes future order placement.

That feedback is closer to recurrence / state-space structure than generic multiscale mixing.

Benchmark Experiment¶

Compare:

TimeMixer multiscale decomposition;
REM/RSA-inspired recurrence model;
hybrid: multiscale decomposition + recurrence bias.

B8. Brigato et al. · No Champions in Long-Term Time Series Forecasting¶

Sources:

arXiv: https://arxiv.org/abs/2502.14045

Core Idea¶

The paper argues that long-term time-series forecasting lacks stable champions. Small changes in benchmark setup, hyperparameters, or metrics can reverse model rankings. It reports a broad reproducible evaluation over thousands of trained networks and many datasets.

Why It Matters¶

This paper supports both:

benchmark-first thinking;
skepticism toward "we beat SOTA" claims.

It also reinforces Crypto-Alpha-Bench:

if even standard LTSF lacks stable champions, alpha search absolutely needs fixed protocol and statistical reporting.

HKU Soundbite¶

"No Champions is the time-series version of my benchmark argument: without standardized evaluation, model rankings are fragile."

B9. Huang et al. · Encoding Recurrence into Transformers¶

Sources:

OpenReview: https://openreview.net/forum?id=7YfHla7IxBJ
GitHub: https://github.com/neithen-Lu/encoding_recurrence_into_transformers

Core Idea¶

The paper decomposes recurrent dynamics into lightweight positional-encoding-like matrices, called Recurrence Encoding Matrix (REM), and injects them into self-attention via Self-Attention with Recurrence (RSA).

The key move:

recurrence becomes a structural property inside attention, rather than an external RNN bolted on top.

Why It Matters For Prof. Li¶

This is the cleanest direct anchor to Prof. Guodong Li.

Your respectful framing should be:

"I am not claiming my setting is the same as the original paper. I want to test whether the same principle, explicit recurrence priors for sample efficiency, applies to crypto microstructure."

RQ1 Formalization¶

Your current rough phrase "microstructure-price recurrence" should become more formal:

Let market state be:

x_t = [price_bar_t, order_book_t, trade_flow_t, funding/OI_t, cross_asset_t]

Hypothesized feedback:

order_flow_t -> price_move_{t+1}
price_move_t -> order_book_response_{t+1}
book_thickness_t -> fill_quality_{t+1}
fill_quality_t -> realized strategy return_{t+1}

The recurrence prior should encode:

cross-scale memory;
price-book feedback;
fast/slow state separation;
gating between recurrent and non-recurrent signals.

Minimal Model Idea¶

Start with a conservative extension:

Use PatchTST / TimeMixer preprocessing for multiscale features.
Add an RSA-inspired recurrence matrix over:
time axis;
feature-group axis: price / book / flow / funding / cross-asset.
Gate recurrence strength by regime variables:
volatility;
spread;
depth;
funding window.

Benchmark Experiment¶

RQ1 experiment table:

Model	Role
LightGBM	current tabular baseline
Optuna rules	interpretable search baseline
PatchTST	supervised architectural prior baseline
TimesFM / Chronos	foundation model baseline
Moirai	multivariate foundation baseline
TimeMixer	multiscale baseline
REM/RSA-inspired model	proposed method

Metrics:

predictive loss;
rank IC;
directional accuracy;
cost-adjusted MTM;
DSR;
PBO;
stability across time slices.

HKU Soundbite¶

"The RQ is not 'can a neural net predict crypto.' The RQ is whether explicit recurrence priors improve time-slice stability under strict walk-forward evaluation."

3. Cluster C · AlphaEval & Benchmark Design¶

C1. Ding et al. · AlphaEval¶

Sources:

arXiv: https://arxiv.org/abs/2508.13174
Search results also show implementation references, but verify the exact GitHub repo before citing in slides.

Core Problem¶

Formula alpha mining lacks a common evaluation framework. Existing approaches rely heavily on:

backtesting: expensive, sequential, sensitive to strategy assumptions;
correlation metrics: efficient but too narrow.

AlphaEval proposes a backtest-free, parallelizable evaluation framework.

Five Dimensions¶

Source-verified abstract dimensions:

predictive power;
stability;
robustness to market perturbations;
financial logic;
diversity.

Why It Matters¶

AlphaEval is not the same as Crypto-Alpha-Bench.

Aspect	AlphaEval	Crypto-Alpha-Bench
Main object	evaluation framework for generated formula alphas	fixed benchmark infrastructure
Dataset	not the main contribution	fixed public dataset is central
Cost model	not central in abstract	three cost tiers required
Compute control	not central in abstract	required
Multiple testing	not central in abstract	DSR/PBO required
Tradability	partial / indirect	explicit microstructure tradability gate
Synthetic ground truth	not central in abstract	required

The relationship should be:

Crypto-Alpha-Bench can adopt AlphaEval-style dimensions as its evaluation backbone, then add benchmark infrastructure and executable-alpha constraints.

How To Integrate With Your System¶

Two-stage evaluator:

Exploration evaluator:
AlphaEval-like cheap dimensions;
quickly screen thousands of candidate alphas.
Verification evaluator:
walk-forward backtest;
cost tiers;
microstructure gate;
DSR/PBO;
paper-trading / live-fill calibration if available.

This maps cleanly to AI-for-science systems:

Exploration: cheap parallel evaluation.
Verification: expensive hard verifier.

HKU Soundbite¶

"AlphaEval gives the evaluation dimensions. Crypto-Alpha-Bench adds the missing benchmark substrate: fixed data, costs, compute control, synthetic ground truth, and executable-alpha verification."

4. Cross-Cluster Synthesis¶

4.1 The Three Clusters Form One Argument¶

Cluster	What It Gives You	What It Changes In The Talk
Backtest rigor	Language for false discovery	Negative result becomes methodology, not anecdote
TS foundation / priors	Positioning for RQ1	You are not ignoring foundation models; you are testing prior vs scale
AlphaEval	Evaluation framework anchor	Crypto-Alpha-Bench becomes a natural extension, not a vague idea

4.2 Revised Research Claim¶

Old claim:

"I built a trading agent and want it to self-evolve."

Better claim:

"I built a production verifier for executable alpha. The field lacks the benchmark substrate needed to test AI-driven alpha discovery. Crypto-Alpha-Bench is the first contribution; microstructure recurrence is the first method use case."

4.3 What To Avoid Saying¶

Avoid:

"12 folds proves significance."
"LLM can discover profitable alpha if scaled."
"Crypto benchmark will generalize to all finance."
"Chronos / TimesFM are bad for finance" before testing.
"Published anomalies are knowledge" without replication metadata.
"AlphaEval already solves evaluation" or "AlphaEval is insufficient" without nuance.

Say instead:

"12 folds is a production screen; the research version needs PBO/DSR."
"Compute-scaled discovery is a hypothesis requiring a fixed benchmark."
"Crypto is the v0 open-world testbed."
"Foundation models are mandatory baselines."
"Cognition Base must be replication-aware."
"AlphaEval is an evaluation backbone; Crypto-Alpha-Bench is benchmark infrastructure."

5. Implementation Checklist¶

5.1 Statistical Rigor Workstream¶

Build candidate performance matrix.
Implement CSCV/PBO.
Implement PSR/DSR.
Implement null-search bootstrap.
Expand folds/slices beyond current production 12-fold view.
Produce one figure: IS rank vs OOS rank degradation.

5.2 RQ1 Model Workstream¶

Benchmark LightGBM and Optuna rules on ETH as validation.
Add PatchTST baseline.
Add Chronos / TimesFM zero-shot baseline.
Add Moirai multivariate baseline.
Add TimeMixer multiscale baseline.
Specify REM/RSA-inspired microstructure recurrence model.
Report stability, not just accuracy.

5.3 Benchmark Workstream¶

Define v0 dataset manifest.
Define cost tiers.
Define compute budget tiers.
Define required metrics.
Define negative controls.
Add synthetic alpha generator.
Add tradability gate baseline.
Optional: human expert discretionary baseline protocol.

6. One-Minute Deep-Read Summary¶

"The backtest-overfitting literature gives me the statistical discipline: PBO, DSR, and replication-aware standards. The time-series literature tells me RQ1 should be framed as a test between generic foundation models and explicit architectural priors. AlphaEval gives a modern alpha-evaluation backbone, but it is not a full benchmark. My proposal is to combine these: Crypto-Alpha-Bench as fixed evaluation infrastructure, with microstructure recurrence as the first method use case."