HKU Research Discussion Outline · 2026-05-20¶

Topic: From Production AI Trading Agent to Crypto-Alpha-Bench
Audience: Prof. Kai Han, Prof. Guodong Li
Target length: 18-22 min presentation + discussion
Core goal: get directional feedback on whether to pursue benchmark-first or method-first.

0. One-Sentence Thesis¶

我已经做出 AI-driven alpha discovery 的 verification half。现在最值得推进的研究贡献不是“再做一个交易 agent”，而是建立一个可执行、可复现、可比较的 Crypto-Alpha-Bench，让金融里的 compute-scaled discovery 可以被严格检验。

English version:

I have built the verification half of an AI-driven alpha discovery platform. The missing research substrate is an executable crypto alpha-search benchmark.

1. Recommended Talk Arc¶

Act 1 · I Already Built the Verifier¶

目的：证明你不是从 paper idea 出发，而是从 production-grade system 出发。

Must say:

LLM never reaches OMS directly.
Natural language only becomes schema intent.
Real-money actions pass deterministic risk gate, button confirmation, OMS, audit, reconciliation, kill switch.
M8.6 walk-forward testbed covers 526 Binance USD-M perpetual symbols, 15s bars, 12 rolling folds, microstructure gates, adaptive state controller.

Key transition:

This is not a trading pitch. It is a research infrastructure pitch: I already have the hard verifier.

Act 2 · The Negative Result Is the Research Signal¶

目的：把 recent project 的失败结果讲成研究问题，而不是工程失败。

Use the strongest numbers:

LightGBM chronological validation MTM: +44.65
LightGBM chronological test MTM: -86.87
Optuna search objective: +44.46
Optuna chronological test objective: -105.78

Diagnosis:

The bottleneck is time-slice stability, not model capacity.

Prof. Li hook:

This motivates architectural priors for crypto microstructure recurrence, rather than simply scaling generic models.

Act 3 · Existing SOTA Is Close, But Not Enough¶

目的：主动处理“是不是别人已经做了”的问题。

Must say:

AlphaBench already covers formulaic alpha mining benchmark.
RD-Agent(Q) already covers multi-agent quant R&D.
Hubble / FactorMiner / CogAlpha push safe generation, memory, and code evolution.
TradingAgents / QuantAgent are trading-decision agents, not alpha-search benchmarks.

Clean positioning:

The claim is not that nothing exists. The claim is that existing work stops before executable crypto alpha under cost, fill, statistical, and compute constraints.

Act 4 · Proposal: Crypto-Alpha-Bench¶

目的：给出一个可以发表、可以开源、可以协作的 research artifact。

Minimum requirements:

Fixed public crypto perpetual dataset.
Three transaction-cost tiers.
Fill / tradability gate.
DSR + PBO + null-search baseline.
Compute-controlled search budget.
Synthetic ground-truth tasks.
Reference baselines, including AlphaBench-style search, RD-Agent-style workflow, GFlowNet/AlphaSAGE, M8.6 tradability gate, and optional human expert baseline.

Act 5 · HKU Fit + Ask¶

Prof. Han hook:

Crypto markets are an extreme open-world reliability setting for LLM agents: non-stationary regimes, adversarial counterparties, unknown failure modes, and costly mistakes.

Prof. Li hook:

The statistical core is time-series stability, financial econometrics, and recurrence-aware modeling under strict walk-forward verification.

Final ask:

Should I pursue Crypto-Alpha-Bench first, or narrow the first paper to RQ1: architectural priors for crypto microstructure recurrence?

2. Slide-by-Slide Outline¶

#	Slide	Claim	Proof Object
1	Title	This is a research infrastructure pitch, not a trading pitch.	Title + thesis rail
2	Meeting Thesis	I have the verifier; the field lacks the benchmark.	Three-part argument map
3	What I Built	The production system already enforces generator-verifier separation.	Architecture flow
4	The Hard Boundary	LLMs assist language and research, never irreversible actions.	Control boundary diagram
5	M8.6 Testbed	The strongest asset is a crypto executable-alpha verifier.	Metric rail + walk-forward pipeline
6	Negative Result	Time-slice stability, not model capacity, is the bottleneck.	Val/test MTM bar proof
7	Frontier Pattern	AI discovery systems scale only when generation is separated from verification.	Pattern mapping
8	SOTA Reality Check	Existing work is close; the gap is narrower and sharper.	Agent landscape matrix
9	Benchmark Gap	Current systems optimize formula quality or trade decisions, not executable crypto alpha.	Gap funnel
10	Crypto-Alpha-Bench	The contribution is a hard benchmark substrate.	7 requirement pillars
11	Baselines	The benchmark should beat or absorb the current SOTA, not ignore it.	Baseline ladder
12	Research Triangle	Benchmark-first makes RQ1/RQ2/RQ3 measurable.	HKU alignment map
13	8-12 Week Plan	Start with a small v0 that tests the riskiest assumptions.	Roadmap + risk ledger
14	Ask	The meeting should decide benchmark-first vs method-first.	Decision slide

3. Compressed 5-Minute Version¶

I built a production crypto trading agent where LLMs are structurally separated from real-money execution.
The research asset is the verifier: 526 Binance perps, 15s bars, rolling walk-forward, microstructure gates, adaptive state.
LightGBM/Optuna showed validation wins but chronological test failures, suggesting time-slice instability.
The latest SOTA already covers formula alpha mining and quant R&D agents, so the benchmark claim must be narrower.
My proposal is Crypto-Alpha-Bench: executable crypto alpha search under cost, fill, DSR/PBO, and compute control.
I want advice on whether to publish benchmark-first or start with RQ1 microstructure recurrence.

4. Q&A Anchors¶

If asked "Isn't AlphaBench already this?"

AlphaBench is the closest sibling. It benchmarks LLM formula alpha mining. Crypto-Alpha-Bench extends the target to executable crypto mid-frequency alpha under cost, fill, statistical correction, and compute constraints.

If asked "Why crypto?"

Crypto is narrow, but it is a clean open-world testbed: 24/7, public, non-stationary, adversarial, microstructure-rich, and execution-sensitive.

If asked "Why are you the right person?"

I may not have the largest academic platform, but I have production tradability infrastructure, real LLM-agent safety constraints, and a working verifier that academic teams would take months to reproduce.

If asked "What is the smallest next experiment?"

ETH validation experiment: rerun the LightGBM/Optuna time-slice stability test on ETH/USDT. If the validation-to-test reversal reproduces, benchmark motivation strengthens.

5. Closing Line¶

I am not proposing to let LLMs trade. I am proposing to benchmark whether AI systems can discover alphas that survive reality.