Skip to content

HKU Meeting Prep · 2026-05-20

Goal: turn the current notes into a sharp professor-facing discussion pack. Audience assumed: Prof. Kai Han + Prof. Guodong Li. Default format: 20-25 min presentation + 15-20 min discussion. If time is shorter, use the compressed path below.


1. Meeting Objective

Do not frame the meeting as "please approve my trading system."

Frame it as:

I have built the verification half of an AI-for-alpha-search platform. After surveying the field, my strongest research judgment is that alpha auto search lacks a unified benchmark. I want your feedback on whether Crypto-Alpha-Bench is the right first academic contribution, or whether I should start from a narrower method paper such as microstructure recurrence.

The desired output from the meeting:

  1. Decide whether benchmark-first is academically credible.
  2. If benchmark-first is too broad, decide whether RQ1 method-first is the right fallback.
  3. Identify which parts naturally belong to Prof. Han / Prof. Li.
  4. Ask whether HKU would be a suitable research environment for turning the existing production system into a reproducible research testbed.

2. One-Sentence Thesis

My production trading system already implements a strict generator-verifier separation. The next research step is not to let LLMs trade, but to build the field's first unified benchmark for alpha auto search, so that compute-scaled discovery can be tested rigorously in finance.

Chinese version:

我现有系统已经把 LLM 生成和 deterministic verifier 分离开了。下一步研究不是让 LLM 直接交易,而是先建立 alpha auto search 的统一 benchmark,让金融里的 "compute-scaled discovery" 可以被严格检验。


Use benchmark-centered structure. Keep the three RQs as use cases, not as the main object.

Act 1 · I Built a Verification Platform

Takeaway:

I am not starting from an abstract idea. I already have a production-grade verifier and walk-forward testbed.

Show only the minimum:

  • LLM never reaches OMS directly.
  • Deterministic risk gate + kill switch + append-only audit + KMS abstraction.
  • M8.6 walk-forward verification on crypto perp microstructure.
  • Negative result: LightGBM / Optuna validation looked good but chronological test failed, pointing to time-slice stability rather than model capacity.

Act 2 · Frontier AI Discovery Has a Common Pattern

Takeaway:

FunSearch / AlphaProof / AlphaEvolve / ASI-ARCH all separate generator and verifier. Finance has a verifier problem and a benchmark problem.

Four patterns:

  • Generator-verifier separation.
  • Cognition base / knowledge grounding.
  • Multi-agent decomposition.
  • Compute-scaled discovery.

Mapping:

  • My system has verification.
  • It lacks discovery/generation.
  • The field lacks a unified baseline, so discovery cannot be compared.

Act 3 · Proposal: Crypto-Alpha-Bench

Takeaway:

Build the missing evaluation substrate first; then RQ1/RQ2/RQ3 become measurable claims.

Six requirements:

  1. Fixed public crypto perp dataset.
  2. Three cost tiers.
  3. Compute-controlled budget.
  4. Multi-metric evaluation: AlphaEval-like dimensions + capacity + DSR + PBO.
  5. Synthetic ground-truth task.
  6. Replication-aware must-beat baseline.

Then show:

  • Reference baselines.
  • 12-week roadmap.
  • Professor-specific ownership.
  • Ask for feedback.

4. 12-Slide Version

Use this if you have 15-20 minutes.

# Slide Main Point Must Say
1 Title Production AI Trading Agent → Self-Evolving Research "This is not a trading pitch; it is a research infrastructure pitch."
2 One Thesis Generator-verifier separation + missing benchmark Say the one-sentence thesis.
3 What I Built M0-M8.6 production stack LLM is never on the order path.
4 Safety Verifier deterministic gate / audit / KMS / kill switch Compare to Lean kernel only as philosophy, not mathematical certainty.
5 Walk-Forward Testbed 526 symbols / 15s bars / microstructure gate / adaptive state This is your strongest engineering differentiator.
6 Negative Result LightGBM / Optuna val strong, test weak "Bottleneck is time-slice stability, not model capacity."
7 Frontier Pattern FunSearch → AlphaProof → AlphaEvolve → ASI-ARCH 4 common patterns.
8 Gap Mapping My system vs frontier I have verification; I lack generation; the field lacks baseline.
9 No ImageNet Moment alpha auto search has no common benchmark Strong claim, make it clean.
10 Crypto-Alpha-Bench 6 requirements This is the concrete proposal.
11 Three Use Cases RQ1 / RQ2 / RQ3 RQs become benchmark use cases, not disconnected ideas.
12 Ask benchmark-first or method-first? Ask for sharpening, not endorsement.

5. 5-Minute Version

Use this if the meeting becomes informal or time is cut.

  1. "I built a production-grade crypto trading agent where LLMs are strictly kept away from the order path."
  2. "The strongest research artifact is the verification testbed: walk-forward, microstructure gates, adaptive state controller, and a negative result showing time-slice instability."
  3. "After surveying AI-for-science and LLM alpha mining, I think the field's bottleneck is not another LLM agent, but the absence of a unified benchmark."
  4. "My proposal is Crypto-Alpha-Bench: fixed dataset, cost tiers, compute budget, DSR/PBO, synthetic ground truth, and reference baselines."
  5. "My question is: should this be the first research contribution, or should I narrow to RQ1 first: architectural priors for crypto microstructure recurrence?"

6. Professor-Specific Positioning

Prof. Kai Han

Verified public alignment:

  • HKU Visual AI Lab.
  • Open-world learning, spatial intelligence, foundation models, generative AI, agentic / embodied AI.
  • Goal includes reliable AI systems for open-world use.
  • Sources: personal site, HKU CDS profile.

How to speak to him:

Crypto perpetual futures are an extreme open-world reliability setting: non-stationary regimes, adversarial counterparties, unknown failure modes, and costly mistakes. My production agent gives a real-stakes testbed for open-world LLM-agent reliability.

Best hook:

  • RQ2: statistical safety / conformal uncertainty for LLM agents in high-stakes tool-use workflows.
  • Human-expert-in-loop add-on: tacit knowledge extraction from discretionary trader decisions as a financial version of learning from expert demonstrations.

Avoid:

  • Over-claiming that trading is "the same as vision." Say it is an analogy of open-world reliability, not modality equivalence.

Prof. Guodong Li

Verified public alignment:

How to speak to him:

The LightGBM / Optuna failure suggests the bottleneck is not model capacity, but time-slice stability. I want to test whether explicit recurrence priors, inspired by Encoding Recurrence into Transformers, can improve crypto microstructure modeling.

Best hook:

  • RQ1: microstructure-price cross-scale recurrence.
  • Benchmark statistics: DSR, PBO, synthetic ground-truth tasks, multiple-testing correction.

Avoid:

  • Saying "12 folds proves significance." It does not. Say 12 folds are a production screening protocol; the research version needs 100+ folds / CSCV / DSR / PBO.

7. The Human-Expert-In-Loop Revision

Use this as a differentiator, not as the main thesis unless they ask "why you?"

Core claim:

The unique asset is not only production verification infrastructure. It is production infrastructure plus access to a real discretionary expert whose tacit market-reading decisions can be logged.

How to insert it:

  • On the reference baseline slide, add optional row: Human Expert Discretionary Baseline.
  • In Q&A, if asked about uniqueness vs GAIR/Stanford/DeepMind, answer:
  • They may have stronger institutional resources.
  • I have a real production trading workflow and a real discretionary expert in the loop.
  • That enables a benchmark baseline most academic teams cannot easily reproduce.

Do not overdo it:

  • This direction needs trader cooperation.
  • It rests on an assumption: the expert's real edge is tape reading / directional timing.
  • Keep it as a sharp future extension, not a fully proven claim.

8. Risks To Acknowledge Proactively

Mention these before they do. It makes the proposal feel mature.

Risk Clean Answer
"Why are you the right person?" I am not the best institutionally positioned, but I have production-grade tradability infrastructure and speed. I should open v0 early and invite collaboration.
"Is crypto too narrow?" Crypto is narrow but clean: 24/7, public, open-world, adversarial, and microstructure-rich. v0 crypto; v1 can extend.
"Your verifier may be overfit." Correct. I will publish design history, sensitivity analysis, and held-out future windows.
"Cost model is subjective." Use three tiers and empirical calibration from fills; do not claim transfer without recalibration.
"Benchmark may be too big." Plan B is RQ1 method-first; benchmark becomes evaluation protocol.

9. Questions To Ask Them

Ask these explicitly near the end.

  1. "Do you see Crypto-Alpha-Bench as a legitimate first research contribution, or would you recommend a narrower method paper first?"
  2. "For Prof. Li: is the microstructure recurrence framing mathematically defensible, or should I formulate it differently?"
  3. "For Prof. Han: does open-world LLM-agent reliability in financial execution sound aligned with your lab's open-world agenda, or is it too far from the lab's modality focus?"
  4. "What would be the smallest 8-12 week experiment that would convince you this agenda is worth pursuing?"
  5. "If I were to join an academic group, what would you want me to change first: benchmark scope, statistical rigor, or research question framing?"

10. Likely Strongest Q&A

Q: "Isn't this just engineering?"

Answer:

The production system is engineering. The research contribution is the evaluation substrate: fixed data, cost model, compute control, DSR/PBO, synthetic ground truth, and reference baselines. This converts engineering infrastructure into a reproducible scientific instrument.

Q: "Why benchmark rather than a new model?"

Answer:

Because without a fixed evaluation substrate, a new model cannot make a credible SOTA claim. In alpha auto search, every paper defines its own data and metric. A method paper now risks being incomparable; a benchmark makes all later method papers measurable.

Q: "Why crypto?"

Answer:

Not because crypto is universal, but because it is public, high-frequency, open-world, and adversarial. It is a clean testbed for the exact failure modes we care about: regime shift, cost sensitivity, and benchmark overfitting.

Q: "What if the benchmark is gamed?"

Answer:

Three defenses: cost-tier reporting, synthetic ground-truth tasks, and versioned held-out future windows. Benchmark gaming cannot be eliminated, but it can be made visible.

Q: "What is the first experiment?"

Answer:

Two candidates. If benchmark-first: release v0 dataset + protocol + random/gplearn/M8.6 baselines. If method-first: ETH validation + microstructure recurrence model vs LightGBM/PatchTST/Chronos-like baselines.


11. 24-Hour Checklist

  • Decide whether to present 12-slide version or 29-slide version.
  • Prepare one-page handout from HKU_ONE_PAGE_HANDOUT_2026-05-20.md.
  • Keep RESEARCH_PLAN.md open for detailed roadmap questions.
  • Keep crypto_alpha_bench_risk_analysis.md open for risk questions.
  • Keep human_expert_in_loop_research_direction.md open only as backup.
  • Memorize the one-sentence thesis.
  • Practice the 5-minute version once.
  • Prepare a repo link to send after the meeting.

12. Recommendation

For tomorrow, use the following priority:

  1. Main pitch: Crypto-Alpha-Bench.
  2. Primary fallback: RQ1 microstructure recurrence method paper.
  3. Sharp differentiator: human expert discretionary baseline.
  4. Do not lead with: full self-evolving research platform. It is too broad for a first meeting.