HKU Meeting Prep · 2026-05-20¶

Goal: turn the current notes into a sharp professor-facing discussion pack. Audience assumed: Prof. Kai Han + Prof. Guodong Li. Default format: 20-25 min presentation + 15-20 min discussion. If time is shorter, use the compressed path below.

1. Meeting Objective¶

Do not frame the meeting as "please approve my trading system."

Frame it as:

I have built the verification half of an AI-for-alpha-search platform. After surveying the field, my strongest research judgment is that alpha auto search lacks a unified benchmark. I want your feedback on whether Crypto-Alpha-Bench is the right first academic contribution, or whether I should start from a narrower method paper such as microstructure recurrence.

The desired output from the meeting:

Decide whether benchmark-first is academically credible.
If benchmark-first is too broad, decide whether RQ1 method-first is the right fallback.
Identify which parts naturally belong to Prof. Han / Prof. Li.
Ask whether HKU would be a suitable research environment for turning the existing production system into a reproducible research testbed.

2. One-Sentence Thesis¶

My production trading system already implements a strict generator-verifier separation. The next research step is not to let LLMs trade, but to build the field's first unified benchmark for alpha auto search, so that compute-scaled discovery can be tested rigorously in finance.

Chinese version:

我现有系统已经把 LLM 生成和 deterministic verifier 分离开了。下一步研究不是让 LLM 直接交易，而是先建立 alpha auto search 的统一 benchmark，让金融里的 "compute-scaled discovery" 可以被严格检验。

3. Recommended Storyline¶

Use benchmark-centered structure. Keep the three RQs as use cases, not as the main object.

Act 1 · I Built a Verification Platform¶

Takeaway:

I am not starting from an abstract idea. I already have a production-grade verifier and walk-forward testbed.

Show only the minimum:

LLM never reaches OMS directly.
Deterministic risk gate + kill switch + append-only audit + KMS abstraction.
M8.6 walk-forward verification on crypto perp microstructure.
Negative result: LightGBM / Optuna validation looked good but chronological test failed, pointing to time-slice stability rather than model capacity.

Act 2 · Frontier AI Discovery Has a Common Pattern¶

Takeaway:

FunSearch / AlphaProof / AlphaEvolve / ASI-ARCH all separate generator and verifier. Finance has a verifier problem and a benchmark problem.

Four patterns:

Generator-verifier separation.
Cognition base / knowledge grounding.
Multi-agent decomposition.
Compute-scaled discovery.

Mapping:

My system has verification.
It lacks discovery/generation.
The field lacks a unified baseline, so discovery cannot be compared.

Act 3 · Proposal: Crypto-Alpha-Bench¶

Takeaway:

Build the missing evaluation substrate first; then RQ1/RQ2/RQ3 become measurable claims.

Six requirements:

Fixed public crypto perp dataset.
Three cost tiers.
Compute-controlled budget.
Multi-metric evaluation: AlphaEval-like dimensions + capacity + DSR + PBO.
Synthetic ground-truth task.
Replication-aware must-beat baseline.

Then show:

Reference baselines.
12-week roadmap.
Professor-specific ownership.
Ask for feedback.

4. 12-Slide Version¶

Use this if you have 15-20 minutes.

#	Slide	Main Point	Must Say
1	Title	Production AI Trading Agent → Self-Evolving Research	"This is not a trading pitch; it is a research infrastructure pitch."
2	One Thesis	Generator-verifier separation + missing benchmark	Say the one-sentence thesis.
3	What I Built	M0-M8.6 production stack	LLM is never on the order path.
4	Safety Verifier	deterministic gate / audit / KMS / kill switch	Compare to Lean kernel only as philosophy, not mathematical certainty.
5	Walk-Forward Testbed	526 symbols / 15s bars / microstructure gate / adaptive state	This is your strongest engineering differentiator.
6	Negative Result	LightGBM / Optuna val strong, test weak	"Bottleneck is time-slice stability, not model capacity."
7	Frontier Pattern	FunSearch → AlphaProof → AlphaEvolve → ASI-ARCH	4 common patterns.
8	Gap Mapping	My system vs frontier	I have verification; I lack generation; the field lacks baseline.
9	No ImageNet Moment	alpha auto search has no common benchmark	Strong claim, make it clean.
10	Crypto-Alpha-Bench	6 requirements	This is the concrete proposal.
11	Three Use Cases	RQ1 / RQ2 / RQ3	RQs become benchmark use cases, not disconnected ideas.
12	Ask	benchmark-first or method-first?	Ask for sharpening, not endorsement.

5. 5-Minute Version¶

Use this if the meeting becomes informal or time is cut.

"I built a production-grade crypto trading agent where LLMs are strictly kept away from the order path."
"The strongest research artifact is the verification testbed: walk-forward, microstructure gates, adaptive state controller, and a negative result showing time-slice instability."
"After surveying AI-for-science and LLM alpha mining, I think the field's bottleneck is not another LLM agent, but the absence of a unified benchmark."
"My proposal is Crypto-Alpha-Bench: fixed dataset, cost tiers, compute budget, DSR/PBO, synthetic ground truth, and reference baselines."
"My question is: should this be the first research contribution, or should I narrow to RQ1 first: architectural priors for crypto microstructure recurrence?"

6. Professor-Specific Positioning¶

Prof. Kai Han¶

Verified public alignment:

HKU Visual AI Lab.
Open-world learning, spatial intelligence, foundation models, generative AI, agentic / embodied AI.
Goal includes reliable AI systems for open-world use.
Sources: personal site, HKU CDS profile.

How to speak to him:

Crypto perpetual futures are an extreme open-world reliability setting: non-stationary regimes, adversarial counterparties, unknown failure modes, and costly mistakes. My production agent gives a real-stakes testbed for open-world LLM-agent reliability.

Best hook:

RQ2: statistical safety / conformal uncertainty for LLM agents in high-stakes tool-use workflows.
Human-expert-in-loop add-on: tacit knowledge extraction from discretionary trader decisions as a financial version of learning from expert demonstrations.

Avoid:

Over-claiming that trading is "the same as vision." Say it is an analogy of open-world reliability, not modality equivalence.

Prof. Guodong Li¶

Verified public alignment:

Time series analysis.
Financial econometrics.
Quantile regression.
High-dimensional data analysis.
Machine learning.
Encoding Recurrence into Transformers is a direct anchor.
Sources: HKU Science profile, HKU Institute of Data Science profile.

How to speak to him:

The LightGBM / Optuna failure suggests the bottleneck is not model capacity, but time-slice stability. I want to test whether explicit recurrence priors, inspired by Encoding Recurrence into Transformers, can improve crypto microstructure modeling.

Best hook:

RQ1: microstructure-price cross-scale recurrence.
Benchmark statistics: DSR, PBO, synthetic ground-truth tasks, multiple-testing correction.

Avoid:

Saying "12 folds proves significance." It does not. Say 12 folds are a production screening protocol; the research version needs 100+ folds / CSCV / DSR / PBO.

7. The Human-Expert-In-Loop Revision¶

Use this as a differentiator, not as the main thesis unless they ask "why you?"

Core claim:

The unique asset is not only production verification infrastructure. It is production infrastructure plus access to a real discretionary expert whose tacit market-reading decisions can be logged.

How to insert it:

On the reference baseline slide, add optional row: Human Expert Discretionary Baseline.
In Q&A, if asked about uniqueness vs GAIR/Stanford/DeepMind, answer:
They may have stronger institutional resources.
I have a real production trading workflow and a real discretionary expert in the loop.
That enables a benchmark baseline most academic teams cannot easily reproduce.

Do not overdo it:

This direction needs trader cooperation.
It rests on an assumption: the expert's real edge is tape reading / directional timing.
Keep it as a sharp future extension, not a fully proven claim.

8. Risks To Acknowledge Proactively¶

Mention these before they do. It makes the proposal feel mature.

Risk	Clean Answer
"Why are you the right person?"	I am not the best institutionally positioned, but I have production-grade tradability infrastructure and speed. I should open v0 early and invite collaboration.
"Is crypto too narrow?"	Crypto is narrow but clean: 24/7, public, open-world, adversarial, and microstructure-rich. v0 crypto; v1 can extend.
"Your verifier may be overfit."	Correct. I will publish design history, sensitivity analysis, and held-out future windows.
"Cost model is subjective."	Use three tiers and empirical calibration from fills; do not claim transfer without recalibration.
"Benchmark may be too big."	Plan B is RQ1 method-first; benchmark becomes evaluation protocol.

9. Questions To Ask Them¶

Ask these explicitly near the end.

"Do you see Crypto-Alpha-Bench as a legitimate first research contribution, or would you recommend a narrower method paper first?"
"For Prof. Li: is the microstructure recurrence framing mathematically defensible, or should I formulate it differently?"
"For Prof. Han: does open-world LLM-agent reliability in financial execution sound aligned with your lab's open-world agenda, or is it too far from the lab's modality focus?"
"What would be the smallest 8-12 week experiment that would convince you this agenda is worth pursuing?"
"If I were to join an academic group, what would you want me to change first: benchmark scope, statistical rigor, or research question framing?"

10. Likely Strongest Q&A¶

Q: "Isn't this just engineering?"¶

Answer:

The production system is engineering. The research contribution is the evaluation substrate: fixed data, cost model, compute control, DSR/PBO, synthetic ground truth, and reference baselines. This converts engineering infrastructure into a reproducible scientific instrument.

Q: "Why benchmark rather than a new model?"¶

Answer:

Because without a fixed evaluation substrate, a new model cannot make a credible SOTA claim. In alpha auto search, every paper defines its own data and metric. A method paper now risks being incomparable; a benchmark makes all later method papers measurable.

Q: "Why crypto?"¶

Answer:

Not because crypto is universal, but because it is public, high-frequency, open-world, and adversarial. It is a clean testbed for the exact failure modes we care about: regime shift, cost sensitivity, and benchmark overfitting.

Q: "What if the benchmark is gamed?"¶

Answer:

Three defenses: cost-tier reporting, synthetic ground-truth tasks, and versioned held-out future windows. Benchmark gaming cannot be eliminated, but it can be made visible.

Q: "What is the first experiment?"¶

Answer:

Two candidates. If benchmark-first: release v0 dataset + protocol + random/gplearn/M8.6 baselines. If method-first: ETH validation + microstructure recurrence model vs LightGBM/PatchTST/Chronos-like baselines.

11. 24-Hour Checklist¶

Decide whether to present 12-slide version or 29-slide version.
Prepare one-page handout from HKU_ONE_PAGE_HANDOUT_2026-05-20.md.
Keep RESEARCH_PLAN.md open for detailed roadmap questions.
Keep crypto_alpha_bench_risk_analysis.md open for risk questions.
Keep human_expert_in_loop_research_direction.md open only as backup.
Memorize the one-sentence thesis.
Practice the 5-minute version once.
Prepare a repo link to send after the meeting.

12. Recommendation¶

For tomorrow, use the following priority:

Main pitch: Crypto-Alpha-Bench.
Primary fallback: RQ1 microstructure recurrence method paper.
Sharp differentiator: human expert discretionary baseline.
Do not lead with: full self-evolving research platform. It is too broad for a first meeting.