HKU Meeting Prep · 2026-05-20¶
Goal: turn the current notes into a sharp professor-facing discussion pack. Audience assumed: Prof. Kai Han + Prof. Guodong Li. Default format: 20-25 min presentation + 15-20 min discussion. If time is shorter, use the compressed path below.
1. Meeting Objective¶
Do not frame the meeting as "please approve my trading system."
Frame it as:
I have built the verification half of an AI-for-alpha-search platform. After surveying the field, my strongest research judgment is that alpha auto search lacks a unified benchmark. I want your feedback on whether Crypto-Alpha-Bench is the right first academic contribution, or whether I should start from a narrower method paper such as microstructure recurrence.
The desired output from the meeting:
- Decide whether benchmark-first is academically credible.
- If benchmark-first is too broad, decide whether RQ1 method-first is the right fallback.
- Identify which parts naturally belong to Prof. Han / Prof. Li.
- Ask whether HKU would be a suitable research environment for turning the existing production system into a reproducible research testbed.
2. One-Sentence Thesis¶
My production trading system already implements a strict generator-verifier separation. The next research step is not to let LLMs trade, but to build the field's first unified benchmark for alpha auto search, so that compute-scaled discovery can be tested rigorously in finance.
Chinese version:
我现有系统已经把 LLM 生成和 deterministic verifier 分离开了。下一步研究不是让 LLM 直接交易,而是先建立 alpha auto search 的统一 benchmark,让金融里的 "compute-scaled discovery" 可以被严格检验。
3. Recommended Storyline¶
Use benchmark-centered structure. Keep the three RQs as use cases, not as the main object.
Act 1 · I Built a Verification Platform¶
Takeaway:
I am not starting from an abstract idea. I already have a production-grade verifier and walk-forward testbed.
Show only the minimum:
- LLM never reaches OMS directly.
- Deterministic risk gate + kill switch + append-only audit + KMS abstraction.
- M8.6 walk-forward verification on crypto perp microstructure.
- Negative result: LightGBM / Optuna validation looked good but chronological test failed, pointing to time-slice stability rather than model capacity.
Act 2 · Frontier AI Discovery Has a Common Pattern¶
Takeaway:
FunSearch / AlphaProof / AlphaEvolve / ASI-ARCH all separate generator and verifier. Finance has a verifier problem and a benchmark problem.
Four patterns:
- Generator-verifier separation.
- Cognition base / knowledge grounding.
- Multi-agent decomposition.
- Compute-scaled discovery.
Mapping:
- My system has verification.
- It lacks discovery/generation.
- The field lacks a unified baseline, so discovery cannot be compared.
Act 3 · Proposal: Crypto-Alpha-Bench¶
Takeaway:
Build the missing evaluation substrate first; then RQ1/RQ2/RQ3 become measurable claims.
Six requirements:
- Fixed public crypto perp dataset.
- Three cost tiers.
- Compute-controlled budget.
- Multi-metric evaluation: AlphaEval-like dimensions + capacity + DSR + PBO.
- Synthetic ground-truth task.
- Replication-aware must-beat baseline.
Then show:
- Reference baselines.
- 12-week roadmap.
- Professor-specific ownership.
- Ask for feedback.
4. 12-Slide Version¶
Use this if you have 15-20 minutes.
| # | Slide | Main Point | Must Say |
|---|---|---|---|
| 1 | Title | Production AI Trading Agent → Self-Evolving Research | "This is not a trading pitch; it is a research infrastructure pitch." |
| 2 | One Thesis | Generator-verifier separation + missing benchmark | Say the one-sentence thesis. |
| 3 | What I Built | M0-M8.6 production stack | LLM is never on the order path. |
| 4 | Safety Verifier | deterministic gate / audit / KMS / kill switch | Compare to Lean kernel only as philosophy, not mathematical certainty. |
| 5 | Walk-Forward Testbed | 526 symbols / 15s bars / microstructure gate / adaptive state | This is your strongest engineering differentiator. |
| 6 | Negative Result | LightGBM / Optuna val strong, test weak | "Bottleneck is time-slice stability, not model capacity." |
| 7 | Frontier Pattern | FunSearch → AlphaProof → AlphaEvolve → ASI-ARCH | 4 common patterns. |
| 8 | Gap Mapping | My system vs frontier | I have verification; I lack generation; the field lacks baseline. |
| 9 | No ImageNet Moment | alpha auto search has no common benchmark | Strong claim, make it clean. |
| 10 | Crypto-Alpha-Bench | 6 requirements | This is the concrete proposal. |
| 11 | Three Use Cases | RQ1 / RQ2 / RQ3 | RQs become benchmark use cases, not disconnected ideas. |
| 12 | Ask | benchmark-first or method-first? | Ask for sharpening, not endorsement. |
5. 5-Minute Version¶
Use this if the meeting becomes informal or time is cut.
- "I built a production-grade crypto trading agent where LLMs are strictly kept away from the order path."
- "The strongest research artifact is the verification testbed: walk-forward, microstructure gates, adaptive state controller, and a negative result showing time-slice instability."
- "After surveying AI-for-science and LLM alpha mining, I think the field's bottleneck is not another LLM agent, but the absence of a unified benchmark."
- "My proposal is Crypto-Alpha-Bench: fixed dataset, cost tiers, compute budget, DSR/PBO, synthetic ground truth, and reference baselines."
- "My question is: should this be the first research contribution, or should I narrow to RQ1 first: architectural priors for crypto microstructure recurrence?"
6. Professor-Specific Positioning¶
Prof. Kai Han¶
Verified public alignment:
- HKU Visual AI Lab.
- Open-world learning, spatial intelligence, foundation models, generative AI, agentic / embodied AI.
- Goal includes reliable AI systems for open-world use.
- Sources: personal site, HKU CDS profile.
How to speak to him:
Crypto perpetual futures are an extreme open-world reliability setting: non-stationary regimes, adversarial counterparties, unknown failure modes, and costly mistakes. My production agent gives a real-stakes testbed for open-world LLM-agent reliability.
Best hook:
- RQ2: statistical safety / conformal uncertainty for LLM agents in high-stakes tool-use workflows.
- Human-expert-in-loop add-on: tacit knowledge extraction from discretionary trader decisions as a financial version of learning from expert demonstrations.
Avoid:
- Over-claiming that trading is "the same as vision." Say it is an analogy of open-world reliability, not modality equivalence.
Prof. Guodong Li¶
Verified public alignment:
- Time series analysis.
- Financial econometrics.
- Quantile regression.
- High-dimensional data analysis.
- Machine learning.
- Encoding Recurrence into Transformers is a direct anchor.
- Sources: HKU Science profile, HKU Institute of Data Science profile.
How to speak to him:
The LightGBM / Optuna failure suggests the bottleneck is not model capacity, but time-slice stability. I want to test whether explicit recurrence priors, inspired by Encoding Recurrence into Transformers, can improve crypto microstructure modeling.
Best hook:
- RQ1: microstructure-price cross-scale recurrence.
- Benchmark statistics: DSR, PBO, synthetic ground-truth tasks, multiple-testing correction.
Avoid:
- Saying "12 folds proves significance." It does not. Say 12 folds are a production screening protocol; the research version needs 100+ folds / CSCV / DSR / PBO.
7. The Human-Expert-In-Loop Revision¶
Use this as a differentiator, not as the main thesis unless they ask "why you?"
Core claim:
The unique asset is not only production verification infrastructure. It is production infrastructure plus access to a real discretionary expert whose tacit market-reading decisions can be logged.
How to insert it:
- On the reference baseline slide, add optional row: Human Expert Discretionary Baseline.
- In Q&A, if asked about uniqueness vs GAIR/Stanford/DeepMind, answer:
- They may have stronger institutional resources.
- I have a real production trading workflow and a real discretionary expert in the loop.
- That enables a benchmark baseline most academic teams cannot easily reproduce.
Do not overdo it:
- This direction needs trader cooperation.
- It rests on an assumption: the expert's real edge is tape reading / directional timing.
- Keep it as a sharp future extension, not a fully proven claim.
8. Risks To Acknowledge Proactively¶
Mention these before they do. It makes the proposal feel mature.
| Risk | Clean Answer |
|---|---|
| "Why are you the right person?" | I am not the best institutionally positioned, but I have production-grade tradability infrastructure and speed. I should open v0 early and invite collaboration. |
| "Is crypto too narrow?" | Crypto is narrow but clean: 24/7, public, open-world, adversarial, and microstructure-rich. v0 crypto; v1 can extend. |
| "Your verifier may be overfit." | Correct. I will publish design history, sensitivity analysis, and held-out future windows. |
| "Cost model is subjective." | Use three tiers and empirical calibration from fills; do not claim transfer without recalibration. |
| "Benchmark may be too big." | Plan B is RQ1 method-first; benchmark becomes evaluation protocol. |
9. Questions To Ask Them¶
Ask these explicitly near the end.
- "Do you see Crypto-Alpha-Bench as a legitimate first research contribution, or would you recommend a narrower method paper first?"
- "For Prof. Li: is the microstructure recurrence framing mathematically defensible, or should I formulate it differently?"
- "For Prof. Han: does open-world LLM-agent reliability in financial execution sound aligned with your lab's open-world agenda, or is it too far from the lab's modality focus?"
- "What would be the smallest 8-12 week experiment that would convince you this agenda is worth pursuing?"
- "If I were to join an academic group, what would you want me to change first: benchmark scope, statistical rigor, or research question framing?"
10. Likely Strongest Q&A¶
Q: "Isn't this just engineering?"¶
Answer:
The production system is engineering. The research contribution is the evaluation substrate: fixed data, cost model, compute control, DSR/PBO, synthetic ground truth, and reference baselines. This converts engineering infrastructure into a reproducible scientific instrument.
Q: "Why benchmark rather than a new model?"¶
Answer:
Because without a fixed evaluation substrate, a new model cannot make a credible SOTA claim. In alpha auto search, every paper defines its own data and metric. A method paper now risks being incomparable; a benchmark makes all later method papers measurable.
Q: "Why crypto?"¶
Answer:
Not because crypto is universal, but because it is public, high-frequency, open-world, and adversarial. It is a clean testbed for the exact failure modes we care about: regime shift, cost sensitivity, and benchmark overfitting.
Q: "What if the benchmark is gamed?"¶
Answer:
Three defenses: cost-tier reporting, synthetic ground-truth tasks, and versioned held-out future windows. Benchmark gaming cannot be eliminated, but it can be made visible.
Q: "What is the first experiment?"¶
Answer:
Two candidates. If benchmark-first: release v0 dataset + protocol + random/gplearn/M8.6 baselines. If method-first: ETH validation + microstructure recurrence model vs LightGBM/PatchTST/Chronos-like baselines.
11. 24-Hour Checklist¶
- Decide whether to present 12-slide version or 29-slide version.
- Prepare one-page handout from
HKU_ONE_PAGE_HANDOUT_2026-05-20.md. - Keep
RESEARCH_PLAN.mdopen for detailed roadmap questions. - Keep
crypto_alpha_bench_risk_analysis.mdopen for risk questions. - Keep
human_expert_in_loop_research_direction.mdopen only as backup. - Memorize the one-sentence thesis.
- Practice the 5-minute version once.
- Prepare a repo link to send after the meeting.
12. Recommendation¶
For tomorrow, use the following priority:
- Main pitch: Crypto-Alpha-Bench.
- Primary fallback: RQ1 microstructure recurrence method paper.
- Sharp differentiator: human expert discretionary baseline.
- Do not lead with: full self-evolving research platform. It is too broad for a first meeting.