Alpha Auto Search · Expanded Deep Reads¶
Purpose: expand the three "minimum increment" deep-read clusters into a reusable research memo. Use this as the serious version behind
alpha_search_deep_reads.md.Updated: 2026-05-19
0. Executive Map¶
The current deep-read set has three jobs:
-
Make the negative result academically legible
The Bailey / López de Prado / Harvey-Liu / Hou-Xue-Zhang line gives the language for saying: "this is not just a bad ML experiment; this is a multiple-testing and overfitting pathology." -
Position RQ1 correctly
The time-series foundation model wave gives the "scale + generic pretraining" side; PatchTST / TimesNet / TimeMixer / Encoding Recurrence give the "architectural prior" side. Your RQ1 is the finance-microstructure version of that debate. -
Turn your verifier into a benchmark proposal
AlphaEval shows that alpha mining needs more than IC/backtest. Crypto-Alpha-Bench can use AlphaEval-like dimensions but add fixed data, costs, compute budgets, synthetic ground truth, DSR/PBO, and tradability.
Recommended speaking hierarchy for HKU:
- Mainline: Crypto-Alpha-Bench
- Method fallback: RQ1 microstructure recurrence
- Statistical rigor hook: PBO / DSR
- Differentiator: production-grade tradability verifier + optional human expert baseline
1. Cluster A · Backtest Overfitting & Multiple Testing¶
A1. Bailey, Borwein, López de Prado, Zhu · The Probability of Backtest Overfitting¶
Sources:
- PDF: https://www.davidhbailey.com/dhbpapers/backtest-prob.pdf
- Journal reference summary: https://colab.ws/articles/10.21314%2FJCF.2016.322
Core Problem¶
Financial research often evaluates many strategy variants and reports the best backtest. Standard train/test splits understate the false-discovery risk because:
- returns are non-IID;
- strategy candidates are correlated;
- the researcher adaptively changes the search space;
- the "best" in-sample rule is selected after seeing many alternatives.
The paper's central contribution is Probability of Backtest Overfitting (PBO): estimate how often the in-sample winner becomes a below-median out-of-sample performer.
Method Core: CSCV¶
Combinatorially Symmetric Cross-Validation (CSCV):
- Split the full historical sample into
Scontiguous slices. - For every combination of
S/2slices as train and the complement as test: - compute performance for all candidate strategies on train;
- select the in-sample winner;
- rank that same strategy on test among all candidates.
- Count how often the selected strategy has poor OOS rank.
Technical note:
- The common summary is
PBO = P(lambda <= 0), wherelambdais the logit transform of the OOS relative rank of the in-sample winner. - In informal speech, you can say: "PBO estimates the probability that the in-sample winner falls below the OOS median." Avoid a rank formula unless you define whether rank 1 means best or worst.
Why It Matters For Your Work¶
Your M8.6 results have exactly the PBO signature:
- many candidates: symbols × offsets × thresholds × Optuna trials;
- high validation MTM;
- poor chronological test MTM;
- strategy selection after inspecting historical performance.
The correct research move is not "try a bigger model." It is:
quantify how much of the discovered performance survives combinatorial OOS ranking.
Concrete Integration Plan¶
For your walk-forward setup:
| Step | Implementation |
|---|---|
| Candidate matrix | rows = time slices, columns = candidate strategies, values = MTM / Sharpe / path-quality score |
| Slice count | start with S=8 for smoke test; target S=16 for serious PBO |
| Strategy candidates | symbol-offset-rule configs or model hyperparameter configs |
| Metric | use net MTM first; later Sharpe/PSR/DSR |
| Output | PBO, OOS rank histogram, degradation curve from IS rank to OOS rank |
Important correction:
- Existing 12-fold walk-forward is good production discipline, but not enough for strong statistical claims.
- You need more folds/slices or a CSCV-compatible strategy matrix.
HKU Soundbite¶
"My current LightGBM/Optuna result is a qualitative PBO warning sign. The next research version should report CSCV-based PBO instead of only saying validation reversed on test."
A2. Bailey & López de Prado · The Deflated Sharpe Ratio¶
Sources:
- SSRN/PDF: https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2460551_code87814.pdf?abstractid=2460551&mirid=1
- David Bailey PDF mirror: https://www.davidhbailey.com/dhbpapers/deflated-sharpe.pdf
Core Problem¶
The observed Sharpe ratio is inflated by:
- selection bias: you selected the best among many trials;
- non-normality: skewness and fat tails distort standard Sharpe inference;
- short samples: high Sharpe over short samples is much less convincing.
The DSR asks:
after accounting for non-normal returns and multiple trials, is this Sharpe still statistically meaningful?
Method Core¶
DSR is built on the Probabilistic Sharpe Ratio (PSR), then replaces the benchmark Sharpe with a selection-adjusted hurdle: the expected maximum Sharpe under N trials.
Implementation ingredients:
- observed Sharpe;
- sample size;
- skewness;
- kurtosis;
- number of trials;
- average correlation among trials or an effective number of independent trials.
Why It Matters For Your Work¶
Your current whitelist / tradability gate uses path-quality heuristics:
- TP fill rate;
- stop-loss count;
- timeout count;
- clean reads;
- adaptive state promotion.
These are production-useful, but for a paper they need a statistical sibling:
each selected symbol/offset should be tested against a selection-adjusted Sharpe hurdle.
Concrete Integration Plan¶
Use DSR as a secondary filter, not a replacement for path-quality:
- Keep production path-quality filters for safety.
- Compute net returns for each candidate strategy.
- Estimate
N_eff_trials, not rawN, because many offsets/symbols are correlated. - Report PSR and DSR alongside MTM.
- In the benchmark protocol, require DSR after any candidate search process.
HKU Soundbite¶
"My current filters are operationally conservative. The research version should add DSR because the whitelist is selected from many symbol/offset candidates."
A3. Harvey, Liu, Zhu · "... and the Cross-Section of Expected Returns"¶
Sources:
- RFS page: https://academic.oup.com/rfs/article/29/⅕/1843824
- NBER version: https://www.nber.org/papers/w20592
Core Problem¶
The factor zoo creates a multiple-testing crisis. If hundreds of papers test hundreds of factors, the traditional t > 2 threshold becomes too permissive.
The famous practical takeaway:
a newly discovered factor needs a higher hurdle, often summarized as
t > 3.0.
Method Core¶
The paper estimates how the appropriate t-stat cutoff should rise over time as the cumulative number of tested factors grows.
The important conceptual move:
significance standards should depend on the research environment's prior data-mining intensity.
Why It Matters For Your Work¶
Crypto-Alpha-Bench should not accept:
- "my factor has IC > 0";
- "my Sharpe beats random";
- "my backtest is positive."
It should require multiple-testing-aware standards.
This paper gives the academic-finance language for that position.
Concrete Integration Plan¶
In benchmark spec:
- every submitted alpha reports t-stat and multiple-testing adjusted threshold;
- benchmark leaderboard distinguishes "raw winner" from "statistically credible winner";
- new factors must clear a higher hurdle than a single isolated hypothesis test.
HKU Soundbite¶
"Harvey-Liu-Zhu is the reason I do not want Crypto-Alpha-Bench to be a simple leaderboard. It must encode the fact that finance has already been heavily mined."
A4. Harvey & Liu · Lucky Factors¶
Sources:
- JFE page: https://www.sciencedirect.com/science/article/abs/pii/S0304405X21001410
- Public PDF: https://jacobslevycenter.wharton.upenn.edu/wp-content/uploads/2015/05/Lucky-Factors.pdf
Core Problem¶
Some factors look significant because they are lucky draws from a huge search space. Closed-form corrections are useful, but simulation can reveal how often "discoveries" arise by chance under realistic dependence.
Method Core¶
Lucky Factors uses bootstrap / resampling logic to simulate the distribution of factor t-stats under a null, then asks whether observed factors remain unusual after accounting for the search process.
Compared with DSR:
| Method | Style | Strength | Weakness |
|---|---|---|---|
| DSR | analytical / semi-closed-form | fast, easy to report | depends on estimated trial count and moments |
| Lucky Factors | simulation / bootstrap | more flexible | more expensive and design-sensitive |
Why It Matters For Your Work¶
For Optuna / LLM-generated / GP-generated strategies, a simulation null is natural:
- randomize labels;
- randomize entry timestamps;
- preserve return autocorrelation via block bootstrap;
- rerun the candidate-search protocol;
- compare observed best to null best.
This is stronger than just comparing to a single random strategy.
Concrete Integration Plan¶
For benchmark v0:
- Add a "null search" baseline:
- same compute budget;
- shuffled/block-bootstrapped labels;
- same alpha search algorithm.
- Report whether discovered alphas exceed the 95th percentile of the null search distribution.
HKU Soundbite¶
"The right negative control is not one random factor. It is the best factor found by the same search algorithm under a null world."
A5. Hou, Xue, Zhang · Replicating Anomalies¶
Sources:
- NBER: https://www.nber.org/papers/w23394
- RFS published version is listed from the NBER page.
Core Problem¶
The anomaly literature contains many published effects that fail under a common replication protocol.
Important source-verified numbers from the NBER abstract:
- 447 anomaly variables compiled.
- 286 anomalies, about 64%, are insignificant at the conventional 5% level under their replication setup.
- With a t-value cutoff of 3, 380 anomalies, about 85%, are insignificant.
- Liquidity variables are especially fragile: 95 out of 102 insignificant in that abstract's setup.
Note on your existing notes:
- The current file uses "65% / 82%" under a multiple-testing-aware cutoff. That is directionally fine, but for a professor-facing exact quote, prefer the source-verified phrasing: roughly two-thirds fail conventional replication; around 85% fail a t=3 hurdle. If using 82%, specify the exact cutoff/protocol.
Why It Matters For Your Work¶
This is the most important paper for RQ3:
Cognition Base cannot be a pile of published anomalies.
It must be replication-aware.
Each knowledge-base entry needs fields like:
mechanism:
original paper:
asset class:
frequency:
reported t-stat:
replicated?:
replication source:
multiple-testing status:
decay evidence:
capacity/friction notes:
last verified timestamp:
Concrete Integration Plan¶
Crypto-Alpha-Bench should include:
- replication metadata for any "known anomaly" baseline;
- a rule that unreplicated factors are hypotheses, not ground truth;
- a Cognition Base ablation:
- raw published anomalies;
- replication-weighted anomalies;
- crypto-native microstructure mechanisms;
- no knowledge base.
HKU Soundbite¶
"The first step in a financial Cognition Base is not collecting papers. It is distinguishing published claims from replicated mechanisms."
2. Cluster B · Time-Series Foundation Models vs Architectural Priors¶
B1. Chronos · Learning the Language of Time Series¶
Sources:
- Amazon Science: https://www.amazon.science/publications/chronos-learning-the-language-of-time-series
- Published in TMLR 2024 according to Amazon Science.
Core Idea¶
Chronos treats time series as a language-modeling problem:
- scale continuous values;
- quantize them into a fixed vocabulary;
- train T5-style transformer architectures with cross-entropy;
- forecast by sampling future tokens and dequantizing.
Why It Matters¶
Chronos is the cleanest representative of:
"time series can be handled by generic sequence modeling plus scale."
This is the strongest counterpoint to your RQ1. If Chronos works on crypto microstructure, then custom recurrence priors may be less necessary.
Crypto Microstructure Concern¶
For 15s crypto:
- quantization may destroy small but tradable microstructure differences;
- univariate tokenization weakens cross-channel structure;
- pretraining data may underrepresent adversarial high-frequency financial series;
- zero-shot forecasting may optimize point/quantile accuracy but not executable alpha after costs.
Benchmark Experiment¶
Use Chronos as a baseline:
- input: close/return series first;
- then engineered microstructure aggregates;
- horizon: 5s / 15s / 1m / 5m / 30m;
- metrics: predictive loss, directional accuracy, IC, cost-adjusted PnL, DSR/PBO.
HKU Soundbite¶
"Chronos is the scale-first baseline. If my recurrence-prior idea cannot beat or complement Chronos under strict walk-forward evaluation, the RQ1 thesis weakens."
B2. TimesFM · Decoder-Only Foundation Model for Time-Series Forecasting¶
Sources:
- Google Research blog: https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/
- ICML 2024 per Google blog.
Core Idea¶
TimesFM is a decoder-only Transformer foundation model for forecasting. Google reports pretraining on a large corpus of about 100B real-world time points, with strong zero-shot performance across public benchmarks.
Why It Matters¶
TimesFM is the "GPT-like" side of the time-series foundation model wave:
- decoder-only;
- large pretraining corpus;
- zero-shot evaluation;
- open model artifacts according to Google blog links.
Crypto Microstructure Concern¶
TimesFM's core bet is generic forecasting transfer. Your setting stresses it because:
- crypto regimes shift quickly;
- execution costs matter more than forecast RMSE;
- microstructure features are multivariate and reactive;
- a model trained mostly on broad time-series corpora may not encode financial adversarial structure.
Benchmark Experiment¶
Use TimesFM as:
- zero-shot baseline;
- optionally fine-tuned baseline if supported in the used release;
- compare against LightGBM, PatchTST, and microstructure-prior model.
Do not claim "TimesFM fails in finance" without testing. Say:
"TimesFM is the fair foundation-model baseline I need to beat."
B3. Moirai · Universal Time Series Forecasting Transformer¶
Sources:
- arXiv: https://arxiv.org/abs/2402.02592
- ICML/PMLR: https://proceedings.mlr.press/v235/woo24a.html
- Salesforce blog: https://www.salesforce.com/blog/moirai/
Core Idea¶
Moirai is a universal forecasting transformer trained on LOTSA, a large-scale open time-series archive. The key engineering ideas include:
- handling arbitrary numbers of variates;
- multiple patch-size projection layers;
- any-variate attention;
- flexible predictive distributions.
Why It Matters¶
Moirai is more relevant than univariate-only models because your market state is multivariate:
- OHLCV;
- spread;
- depth;
- funding;
- OI;
- cross-symbol signals.
Crypto Microstructure Concern¶
Any-variate attention handles variable count, but not necessarily:
- order-book causality;
- microstructure-price feedback;
- adverse selection;
- cost-sensitive actionability.
Benchmark Experiment¶
Moirai should be your multivariate foundation baseline:
- compare univariate close-only vs full multivariate state;
- compare model accuracy vs executable PnL;
- test whether any-variate attention alone captures microstructure recurrence.
HKU Soundbite¶
"Moirai is the strongest generic multivariate baseline. If microstructure recurrence matters, it should show up as an improvement beyond any-variate attention."
B4. Lag-Llama · Probabilistic Time-Series Foundation Model¶
Sources:
- arXiv: https://arxiv.org/abs/2310.08278
- Hugging Face model page: https://huggingface.co/time-series-foundation-models/Lag-Llama
Core Idea¶
Lag-Llama is a decoder-only foundation model for probabilistic univariate time-series forecasting. It uses lagged values as covariates and outputs a distribution rather than just a point forecast.
Why It Matters¶
Financial decisions need uncertainty:
- position sizing;
- kill switch thresholds;
- conformal prediction;
- drawdown control;
- tail-risk filters.
Lag-Llama is useful less as "the winner model" and more as:
a distributional baseline for RQ2 and risk-aware forecasting.
Crypto Microstructure Concern¶
Univariate probabilistic forecasts are not enough for:
- cross-asset lead-lag;
- order-book state;
- strategy-specific labels;
- action-dependent fill outcomes.
Benchmark Experiment¶
Use Lag-Llama to test:
- forecast interval calibration;
- whether probabilistic outputs improve risk gates;
- conformal calibration on top of model quantiles.
B5. PatchTST · A Time Series Is Worth 64 Words¶
Sources:
- OpenReview: https://openreview.net/forum?id=Jbdc0vTOcol
- arXiv: https://arxiv.org/abs/2211.14730
Core Idea¶
PatchTST uses:
- patching: segment the time series into subseries tokens;
- channel independence: each channel is modeled as a univariate sequence sharing weights.
Despite its simplicity, it outperformed many earlier Transformer-based forecasting models.
Why It Matters¶
PatchTST is a methodological warning:
a small architectural prior can beat a larger generic architecture.
This supports the intellectual legitimacy of RQ1.
Crypto Microstructure Concern¶
Channel independence is both strength and weakness:
- strength: sample efficiency, reduced overfitting;
- weakness: microstructure features are coupled, and cross-channel interaction may be the signal.
Benchmark Experiment¶
Use PatchTST as:
- supervised architecture-prior baseline;
- compare channel-independent vs cross-channel variants;
- ablate whether book/price recurrence requires channel interaction.
B6. TimesNet · Temporal 2D-Variation Modeling¶
Sources:
- arXiv: https://arxiv.org/abs/2210.02186
- OpenReview/PDF appears under ICLR 2023 materials.
Core Idea¶
TimesNet models temporal variation through multi-periodicity by transforming 1D time series into 2D representations organized around dominant periods.
Why It Matters¶
It is a frequency/periodicity prior. For crypto:
- funding intervals;
- session effects;
- liquidity cycles;
- weekday/weekend structure;
- event-driven bursts.
Crypto Microstructure Concern¶
High-frequency tape reading may not be periodic. Some signals are reactive and event-driven, not seasonal.
Benchmark Experiment¶
Use TimesNet-like periodic priors for:
- funding-window effects;
- intraday liquidity cycles;
- compare against recurrence-prior model for reactive signals.
B7. TimeMixer · Decomposable Multiscale Mixing¶
Sources:
- ICLR proceedings: https://proceedings.iclr.cc/paper_files/paper/2024/hash/a7ac8a21e5a27e7ab31a5f42a0117bdb-Abstract-Conference.html
- arXiv: https://arxiv.org/abs/2405.14616
Core Idea¶
TimeMixer is a fully MLP-based architecture built around multiscale decomposition and mixing:
- Past-Decomposable-Mixing;
- Future-Multipredictor-Mixing;
- fine and coarse temporal scales disentangled.
Why It Matters¶
Your RQ1 is explicitly cross-scale:
- 100ms/tick order-book dynamics;
- 5s/15s bars;
- 1m/5m trend;
- 30m execution horizon.
TimeMixer is a strong baseline for "multiscale prior without attention."
Crypto Microstructure Concern¶
TimeMixer decomposes scales, but may not encode the feedback loop:
order flow changes price; price changes future order placement.
That feedback is closer to recurrence / state-space structure than generic multiscale mixing.
Benchmark Experiment¶
Compare:
- TimeMixer multiscale decomposition;
- REM/RSA-inspired recurrence model;
- hybrid: multiscale decomposition + recurrence bias.
B8. Brigato et al. · No Champions in Long-Term Time Series Forecasting¶
Sources:
- arXiv: https://arxiv.org/abs/2502.14045
Core Idea¶
The paper argues that long-term time-series forecasting lacks stable champions. Small changes in benchmark setup, hyperparameters, or metrics can reverse model rankings. It reports a broad reproducible evaluation over thousands of trained networks and many datasets.
Why It Matters¶
This paper supports both:
- benchmark-first thinking;
- skepticism toward "we beat SOTA" claims.
It also reinforces Crypto-Alpha-Bench:
if even standard LTSF lacks stable champions, alpha search absolutely needs fixed protocol and statistical reporting.
HKU Soundbite¶
"No Champions is the time-series version of my benchmark argument: without standardized evaluation, model rankings are fragile."
B9. Huang et al. · Encoding Recurrence into Transformers¶
Sources:
- OpenReview: https://openreview.net/forum?id=7YfHla7IxBJ
- GitHub: https://github.com/neithen-Lu/encoding_recurrence_into_transformers
Core Idea¶
The paper decomposes recurrent dynamics into lightweight positional-encoding-like matrices, called Recurrence Encoding Matrix (REM), and injects them into self-attention via Self-Attention with Recurrence (RSA).
The key move:
recurrence becomes a structural property inside attention, rather than an external RNN bolted on top.
Why It Matters For Prof. Li¶
This is the cleanest direct anchor to Prof. Guodong Li.
Your respectful framing should be:
"I am not claiming my setting is the same as the original paper. I want to test whether the same principle, explicit recurrence priors for sample efficiency, applies to crypto microstructure."
RQ1 Formalization¶
Your current rough phrase "microstructure-price recurrence" should become more formal:
Let market state be:
Hypothesized feedback:
order_flow_t -> price_move_{t+1}
price_move_t -> order_book_response_{t+1}
book_thickness_t -> fill_quality_{t+1}
fill_quality_t -> realized strategy return_{t+1}
The recurrence prior should encode:
- cross-scale memory;
- price-book feedback;
- fast/slow state separation;
- gating between recurrent and non-recurrent signals.
Minimal Model Idea¶
Start with a conservative extension:
- Use PatchTST / TimeMixer preprocessing for multiscale features.
- Add an RSA-inspired recurrence matrix over:
- time axis;
- feature-group axis: price / book / flow / funding / cross-asset.
- Gate recurrence strength by regime variables:
- volatility;
- spread;
- depth;
- funding window.
Benchmark Experiment¶
RQ1 experiment table:
| Model | Role |
|---|---|
| LightGBM | current tabular baseline |
| Optuna rules | interpretable search baseline |
| PatchTST | supervised architectural prior baseline |
| TimesFM / Chronos | foundation model baseline |
| Moirai | multivariate foundation baseline |
| TimeMixer | multiscale baseline |
| REM/RSA-inspired model | proposed method |
Metrics:
- predictive loss;
- rank IC;
- directional accuracy;
- cost-adjusted MTM;
- DSR;
- PBO;
- stability across time slices.
HKU Soundbite¶
"The RQ is not 'can a neural net predict crypto.' The RQ is whether explicit recurrence priors improve time-slice stability under strict walk-forward evaluation."
3. Cluster C · AlphaEval & Benchmark Design¶
C1. Ding et al. · AlphaEval¶
Sources:
- arXiv: https://arxiv.org/abs/2508.13174
- Search results also show implementation references, but verify the exact GitHub repo before citing in slides.
Core Problem¶
Formula alpha mining lacks a common evaluation framework. Existing approaches rely heavily on:
- backtesting: expensive, sequential, sensitive to strategy assumptions;
- correlation metrics: efficient but too narrow.
AlphaEval proposes a backtest-free, parallelizable evaluation framework.
Five Dimensions¶
Source-verified abstract dimensions:
- predictive power;
- stability;
- robustness to market perturbations;
- financial logic;
- diversity.
Why It Matters¶
AlphaEval is not the same as Crypto-Alpha-Bench.
| Aspect | AlphaEval | Crypto-Alpha-Bench |
|---|---|---|
| Main object | evaluation framework for generated formula alphas | fixed benchmark infrastructure |
| Dataset | not the main contribution | fixed public dataset is central |
| Cost model | not central in abstract | three cost tiers required |
| Compute control | not central in abstract | required |
| Multiple testing | not central in abstract | DSR/PBO required |
| Tradability | partial / indirect | explicit microstructure tradability gate |
| Synthetic ground truth | not central in abstract | required |
The relationship should be:
Crypto-Alpha-Bench can adopt AlphaEval-style dimensions as its evaluation backbone, then add benchmark infrastructure and executable-alpha constraints.
How To Integrate With Your System¶
Two-stage evaluator:
- Exploration evaluator:
- AlphaEval-like cheap dimensions;
- quickly screen thousands of candidate alphas.
- Verification evaluator:
- walk-forward backtest;
- cost tiers;
- microstructure gate;
- DSR/PBO;
- paper-trading / live-fill calibration if available.
This maps cleanly to AI-for-science systems:
- Exploration: cheap parallel evaluation.
- Verification: expensive hard verifier.
HKU Soundbite¶
"AlphaEval gives the evaluation dimensions. Crypto-Alpha-Bench adds the missing benchmark substrate: fixed data, costs, compute control, synthetic ground truth, and executable-alpha verification."
4. Cross-Cluster Synthesis¶
4.1 The Three Clusters Form One Argument¶
| Cluster | What It Gives You | What It Changes In The Talk |
|---|---|---|
| Backtest rigor | Language for false discovery | Negative result becomes methodology, not anecdote |
| TS foundation / priors | Positioning for RQ1 | You are not ignoring foundation models; you are testing prior vs scale |
| AlphaEval | Evaluation framework anchor | Crypto-Alpha-Bench becomes a natural extension, not a vague idea |
4.2 Revised Research Claim¶
Old claim:
"I built a trading agent and want it to self-evolve."
Better claim:
"I built a production verifier for executable alpha. The field lacks the benchmark substrate needed to test AI-driven alpha discovery. Crypto-Alpha-Bench is the first contribution; microstructure recurrence is the first method use case."
4.3 What To Avoid Saying¶
Avoid:
- "12 folds proves significance."
- "LLM can discover profitable alpha if scaled."
- "Crypto benchmark will generalize to all finance."
- "Chronos / TimesFM are bad for finance" before testing.
- "Published anomalies are knowledge" without replication metadata.
- "AlphaEval already solves evaluation" or "AlphaEval is insufficient" without nuance.
Say instead:
- "12 folds is a production screen; the research version needs PBO/DSR."
- "Compute-scaled discovery is a hypothesis requiring a fixed benchmark."
- "Crypto is the v0 open-world testbed."
- "Foundation models are mandatory baselines."
- "Cognition Base must be replication-aware."
- "AlphaEval is an evaluation backbone; Crypto-Alpha-Bench is benchmark infrastructure."
5. Implementation Checklist¶
5.1 Statistical Rigor Workstream¶
- Build candidate performance matrix.
- Implement CSCV/PBO.
- Implement PSR/DSR.
- Implement null-search bootstrap.
- Expand folds/slices beyond current production 12-fold view.
- Produce one figure: IS rank vs OOS rank degradation.
5.2 RQ1 Model Workstream¶
- Benchmark LightGBM and Optuna rules on ETH as validation.
- Add PatchTST baseline.
- Add Chronos / TimesFM zero-shot baseline.
- Add Moirai multivariate baseline.
- Add TimeMixer multiscale baseline.
- Specify REM/RSA-inspired microstructure recurrence model.
- Report stability, not just accuracy.
5.3 Benchmark Workstream¶
- Define v0 dataset manifest.
- Define cost tiers.
- Define compute budget tiers.
- Define required metrics.
- Define negative controls.
- Add synthetic alpha generator.
- Add tradability gate baseline.
- Optional: human expert discretionary baseline protocol.
6. One-Minute Deep-Read Summary¶
"The backtest-overfitting literature gives me the statistical discipline: PBO, DSR, and replication-aware standards. The time-series literature tells me RQ1 should be framed as a test between generic foundation models and explicit architectural priors. AlphaEval gives a modern alpha-evaluation backbone, but it is not a full benchmark. My proposal is to combine these: Crypto-Alpha-Bench as fixed evaluation infrastructure, with microstructure recurrence as the first method use case."