ICLR · 2021-2026¶

Surveyed 31821 papers across NeurIPS / ICML / ICLR proceedings 2021-2026. Stage A retained 4725; Stage B retained 3535. Final curated set: 120 papers.

Auto-generated 2026-06-03T03:07:09Z. Maintained by Paul Weng. Source pipeline: Semantic Scholar API + Codex relevance scoring.

Statistical Overview¶

Raw Papers Fetched Per Venue Per Year¶

Venue	2021	2022	2023	2024	2025	2026	Total
ICML	1155	1348	1911	2696	2699	15	9824
ICLR	934	1325	2081	2839	1765	12	8956
NeurIPS	2342	2790	3580	4257	67	5	13041

Curated Papers Per Venue Per Year¶

Venue	2021	2022	2023	2024	2025	Total
ICML	8	6	8	9	3	34
ICLR	9	7	11	19	0	46
NeurIPS	12	8	12	8	0	40

Top 10 Most-Cited Papers In Curated Set¶

Rank	Paper	Venue	Year	Citations	Relevance	Theme
1	Direct Preference Optimization: Your Language Model is Secretly a Reward Model	NeurIPS	2023	8954	4	Reinforcement Learning Foundations
2	Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting	NeurIPS	2021	4676	4	Time Series Foundation Models + Architectural Priors
3	Reflexion: language agents with verbal reinforcement learning	NeurIPS	2023	3716	4	LLM Agent Patterns
4	Efficiently Modeling Long Sequences with Structured State Spaces	ICLR	2021	3636	5	Time Series Foundation Models + Architectural Priors
5	A Time Series is Worth 64 Words: Long-term Forecasting with Transformers	ICLR	2022	3550	5	Time Series Foundation Models + Architectural Priors
6	SWE-bench: Can Language Models Resolve Real-World GitHub Issues?	ICLR	2023	2430	4	Benchmark / Evaluation Methodology Papers
7	Decision Transformer: Reinforcement Learning via Sequence Modeling	NeurIPS	2021	2284	4	Reinforcement Learning Foundations
8	The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games	NeurIPS	2021	2248	5	Reinforcement Learning Foundations
9	TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis	ICLR	2022	2115	4	Time Series Foundation Models + Architectural Priors
10	iTransformer: Inverted Transformers Are Effective for Time Series Forecasting	ICLR	2023	1797	5	Time Series Foundation Models + Architectural Priors

Top 10 By Relevance Score¶

Rank	Paper	Venue	Year	Citations	Relevance	Theme
1	Efficiently Modeling Long Sequences with Structured State Spaces	ICLR	2021	3636	5	Time Series Foundation Models + Architectural Priors
2	A Time Series is Worth 64 Words: Long-term Forecasting with Transformers	ICLR	2022	3550	5	Time Series Foundation Models + Architectural Priors
3	The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games	NeurIPS	2021	2248	5	Reinforcement Learning Foundations
4	iTransformer: Inverted Transformers Are Effective for Time Series Forecasting	ICLR	2023	1797	5	Time Series Foundation Models + Architectural Priors
5	Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation	NeurIPS	2023	1783	5	Benchmark / Evaluation Methodology Papers
6	CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis	ICLR	2022	1482	5	Symbolic Regression / Program Synthesis
7	Gorilla: Large Language Model Connected with Massive APIs	NeurIPS	2023	1254	5	Symbolic Regression / Program Synthesis
8	Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers	NeurIPS	2021	1139	5	Time Series Foundation Models + Architectural Priors
9	Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs	ICLR	2023	938	5	Benchmark / Evaluation Methodology Papers
10	Large Language Models Are Zero-Shot Time Series Forecasters	NeurIPS	2023	724	5	Time Series Foundation Models + Architectural Priors

L0-L5 Integration Map¶

Papers by Theme¶

Theme 1: Autonomous Research Agents (AI Scientist family + ASI-ARCH lineage) (count: 4)¶

CycleResearcher: Improving Automated Research via Automated Review · Yixuan Weng, Minjun Zhu et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2411.00816 - Citations: 118 - Core idea: The automation of scientific discovery has been a long-standing goal within the research community, driven by the potential to accelerate knowledge creation. While significant progress has been made using commercial large language models (LLMs) as research assistants or idea generators, the possibility of automating the entire research process with open-source LLMs remains largely unexplored. - Why relevant here: Maps to L1 · Autonomous Research Agents: it is a direct upstream analogue for alpha auto-search as autonomous research. - Code: N/A - Relevance score: 5

LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery · Pingchuan Ma, Tsun-Hsuan Wang et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2405.09783 - Citations: 80 - Core idea: Large Language Models have recently gained significant attention in scientific discovery for their extensive knowledge and advanced reasoning capabilities. However, they encounter challenges in effectively simulating observational feedback and grounding it with language to propel advancements in physical scientific discovery. - Why relevant here: Maps to L1 · Autonomous Research Agents: it is a direct upstream analogue for alpha auto-search as autonomous research. - Code: N/A - Relevance score: 5

BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments · Yusuf H. Roohani, Jian Vora et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2405.17631 - Citations: 64 - Core idea: Agents based on large language models have shown great potential in accelerating scientific discovery by leveraging their rich background knowledge and reasoning capabilities. In this paper, we introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. - Why relevant here: Maps to L1 · Autonomous Research Agents: it is a direct upstream analogue for alpha auto-search as autonomous research. - Code: N/A - Relevance score: 5

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses · Zonglin Yang, Wanhao Liu et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.07076 - Citations: 58 - Core idea: Scientific discovery plays a pivotal role in advancing human society, and recent progress in large language models (LLMs) suggests their potential to accelerate this process. However, it remains unclear whether LLMs can autonomously generate novel and valid hypotheses in chemistry. - Why relevant here: Maps to L1 · Autonomous Research Agents: it is a direct upstream analogue for alpha auto-search as autonomous research. - Code: N/A - Relevance score: 5

Theme 2: LLM Agent Patterns (Reasoning + Planning + Tool use + Memory) (count: 15)¶

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models · Andy Zhou, Kai Yan et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2310.04406 - Citations: 493 - Core idea: While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of LMs in reasoning, acting, and planning. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Chain of Agents: Large Language Models Collaborating on Long-Context Tasks · Yusen Zhang, Ruoxi Sun et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2406.02818 - Citations: 236 - Core idea: Addressing the challenge of effectively processing long contexts has become a critical issue for Large Language Models (LLMs). Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Scaling Large-Language-Model-based Multi-Agent Collaboration · Cheng Qian, Zihao Xie et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2406.07155 - Citations: 213 - Core idea: Recent breakthroughs in large language model-driven autonomous agents have revealed that multi-agent collaboration often surpasses each individual through collective reasoning. Inspired by the neural scaling law--increasing neurons enhances performance, this study explores whether the continuous addition of collaborative agents can yield similar benefits. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making · Yangyang Yu, Zhiyuan Yao et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2407.06567 - Citations: 144 - Core idea: Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast · Xiangming Gu, Xiaosen Zheng et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.08567 - Citations: 138 - Core idea: A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Multi-agent Architecture Search via Agentic Supernet · Gui-Min Zhang, Luyang Niu et al. (ICML, 2025) - Venue: ICML 2025 / arXiv:2502.04180 - Citations: 116 - Core idea: Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement · Antonis Antoniades, Albert Örwall et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.20285 - Citations: 109 - Core idea: Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often follow linear, sequential processes that prevent backtracking and exploration of alternative solutions, limiting their ability to rethink their strategies when initial approaches prove ineffective. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents · Giorgio Piatti, Zhijing Jin et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2404.16698 - Citations: 97 - Core idea: As AI systems pervade human life, ensuring that large language models (LLMs) make safe decisions remains a significant challenge. We introduce the Governance of the Commons Simulation (GovSim), a generative simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs · Andries P. Smit, Paul Duckworth et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2311.17371 - Citations: 96 - Core idea: Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems · Shaokun Zhang, Ming Yin et al. (ICML, 2025) - Venue: ICML 2025 / arXiv:2505.00212 - Citations: 88 - Core idea: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Simple Hierarchical Planning with Diffusion · Chang Chen, Fei Deng et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2401.02644 - Citations: 82 - Core idea: Diffusion-based generative methods have proven effective in modeling trajectories with offline datasets. However, they often face computational challenges and can falter in generalization, especially in capturing temporal abstractions for long-horizon tasks. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML · Patara Trirat, Wonyong Jeong et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2410.02958 - Citations: 80 - Core idea: Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Reflective Multi-Agent Collaboration based on Large Language Models · Xiaohe Bo, Zeyu Zhang et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / DOI:10.52202/079017-4397 - Citations: 68 - Core idea: Benefiting from the powerful language expression and planning capabilities of Large Language Models (LLMs), LLM-based autonomous agents have achieved promising performance in various downstream tasks. Recently, based on the development of single-agent systems, researchers propose to construct LLM-based multi-agent systems to tackle more complicated tasks. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

STAIR: Improving Safety Alignment with Introspective Reasoning · Yichi Zhang, Siyuan Zhang et al. (ICML, 2025) - Venue: ICML 2025 / arXiv:2502.02384 - Citations: 63 - Core idea: Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Reflexion: language agents with verbal reinforcement learning · Noah Shinn, Federico Cassano et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2303.11366 - Citations: 3716 - Core idea: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. - Why relevant here: Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 4

Theme 3: Multi-Agent Systems (debate / collaboration / SOP) (count: 7)¶

Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates · Zengyi Qin, K. Zhang et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2101.05436 - Citations: 168 - Core idea: We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind · Yuan-Fang Wang, Fangwei Zhong et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2111.09189 - Citations: 105 - Core idea: Being able to predict the mental states of others is a key factor to effective social interaction. It is also crucial for distributed multi-agent systems, where agents are required to communicate and cooperate. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems · Guibin Zhang, Yanwei Yue et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.02506 - Citations: 103 - Core idea: Recent advancements in large language model (LLM)-powered agents have shown that collective intelligence can significantly outperform individual capabilities, largely attributed to the meticulously designed inter-agent communication topologies. Though impressive in performance, existing multi-agent pipelines inherently introduce substantial token overhead, as well as increased economic costs, which pose challenges for their large-scale deployments. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Secret Collusion among AI Agents: Multi-Agent Deception via Steganography · S. Motwani, Mikhail Baranchuk et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2402.07510 - Citations: 89 - Core idea: Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Learning Multi-Agent Communication from Graph Modeling Perspective · Shengchao Hu, Li Shen et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2405.08550 - Citations: 71 - Core idea: In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization · Zhen-Yu Tang, Chao Yu et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2103.04564 - Citations: 66 - Core idea: We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Improving Factuality and Reasoning in Language Models through Multiagent Debate · Yilun Du, Shuang Li et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2305.14325 - Citations: 1710 - Core idea: Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. - Why relevant here: Maps to L3 · Multi-agent collaboration and debate: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 4

Theme 4: Time Series Foundation Models + Architectural Priors (count: 29)¶

Efficiently Modeling Long Sequences with Structured State Spaces · Albert Gu, Karan Goel et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2111.00396 - Citations: 3636 - Core idea: A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of \(10000\) or more steps. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers · Yuqi Nie, Nam H. Nguyen et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2211.14730 - Citations: 3550 - Core idea: We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting · Yong Liu, Tengge Hu et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2310.06625 - Citations: 1797 - Core idea: The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers · Albert Gu, Isys Johnson et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2110.13985 - Citations: 1139 - Core idea: Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Large Language Models Are Zero-Shot Time Series Forecasters · Nate Gruver, Marc Finzi et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2310.07820 - Citations: 724 - Core idea: By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting · Shiyu Wang, Haixu Wu et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2405.14616 - Citations: 610 - Core idea: Time series forecasting is widely used in extensive applications, such as traffic planning and weather forecasting. However, real-world time series usually present intricate temporal variations, making forecasting extremely challenging. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Unified Training of Universal Time Series Forecasting Transformers · Gerald Woo, Chenghao Liu et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.02592 - Citations: 568 - Core idea: Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Resurrecting Recurrent Neural Networks for Long Sequences · Antonio Orvieto, Samuel L. Smith et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2303.06349 - Citations: 492 - Core idea: Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably well on long sequence modeling tasks, and have the added benefits of fast parallelizable training and RNN-like fast inference. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Diagonal State Spaces are as Effective as Structured State Spaces · Ankit Gupta, Jonathan Berant (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2203.14343 - Citations: 489 - Core idea: Modeling long range dependencies in sequential data is a fundamental step towards attaining human-level performance in many modalities such as text, vision, audio and video. While attention-based models are a popular and effective choice in modeling short-range interactions, their performance on tasks requiring long range reasoning has been largely inadequate. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Gated Linear Attention Transformers with Hardware-Efficient Training · Songlin Yang, Bailin Wang et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2312.06635 - Citations: 432 - Core idea: Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Adaptive Conformal Predictions for Time Series · Margaux Zaffran, Aymeric Dieuleveut et al. (ICML, 2022) - Venue: ICML 2022 / arXiv:2202.07282 - Citations: 219 - Core idea: Uncertainty quantification of predictive models is crucial in decision-making problems. Conformal prediction is a general and theoretically sound answer. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack · Yuri Kuratov, A. Bulatov et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2406.10149 - Citations: 213 - Core idea: In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Do We Really Need Complicated Model Architectures For Temporal Networks? · Weilin Cong, Si Zhang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2302.11636 - Citations: 207 - Core idea: Recurrent neural network (RNN) and self-attention mechanism (SAM) are the de facto methods to extract spatial-temporal information for temporal graph learning. Interestingly, we found that although both RNN and SAM could lead to a good performance, in practice neither of them is always necessary. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs · Tianwei Ni, Benjamin Eysenbach et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2110.05038 - Citations: 169 - Core idea: Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections · Albert Gu, Isys Johnson et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2206.12037 - Citations: 161 - Core idea: Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling · Liliang Ren, Yang Liu et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2406.07522 - Citations: 155 - Core idea: Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

CKConv: Continuous Kernel Convolution For Sequential Data · David W. Romero, Anna Kuzina et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2102.02611 - Citations: 150 - Core idea: Conventional neural architectures for sequential data present important limitations. Recurrent networks suffer from exploding and vanishing gradients, small effective memory horizons, and must be trained sequentially. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Conformal PID Control for Time Series Prediction · Anastasios Nikolas Angelopoulos, E. Candès et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2307.16895 - Citations: 129 - Core idea: We study the problem of uncertainty quantification for time series prediction, with the goal of providing easy-to-use algorithms with formal guarantees. The algorithms we present build upon ideas from conformal prediction and control theory, are able to prospectively model conformal scores in an online setting, and adapt to the presence of systematic errors due to seasonality, trends, and general distribution shifts. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Provably expressive temporal graph networks · A. Souza, Diego Mesquita et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2209.15059 - Citations: 81 - Core idea: Temporal graph networks (TGNs) have gained prominence as models for embedding dynamic interactions, but little is known about their theoretical underpinnings. We establish fundamental results about the representational power and limits of the two main categories of TGNs: those that aggregate temporal walks (WA-TGNs), and those that augment local message passing with recurrent memory modules (MP-TGNs). - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Sequential Predictive Conformal Inference for Time Series · Chen Xu, Yao Xie (ICML, 2022) - Venue: ICML 2022 / arXiv:2212.03463 - Citations: 79 - Core idea: We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

TILP: Differentiable Learning of Temporal Logical Rules on Knowledge Graphs · Siheng Xiong, Yuan Yang et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2402.12309 - Citations: 77 - Core idea: Compared with static knowledge graphs, temporal knowledge graphs (tKG), which can capture the evolution and change of information over time, are more realistic and general. However, due to the complexity that the notion of time introduces to the learning of the rules, an accurate graph reasoning, e.g., predicting new links between entities, is still a difficult problem. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Temporal Predictive Coding For Model-Based Planning In Latent Space · Tung D. Nguyen, Rui Shu et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2106.07156 - Citations: 66 - Core idea: High-dimensional observations are a major challenge in the application of model-based reinforcement learning (MBRL) to real-world environments. To handle high-dimensional sensory inputs, existing approaches use representation learning to map high-dimensional observations into a lower-dimensional latent space that is more amenable to dynamics estimation and planning. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention · Romain Ilbert, Ambroise Odonnat et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.10198 - Citations: 65 - Core idea: Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Conformal Prediction for Time Series with Modern Hopfield Networks · Andreas Auer, M. Gauch et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2303.12783 - Citations: 60 - Core idea: To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Learning where to learn: Gradient sparsity in meta and continual learning · J. Oswald, Dominic Zhao et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2110.14402 - Citations: 59 - Core idea: Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Deep Latent State Space Models for Time-Series Generation · Linqi Zhou, Michael Poli et al. (ICML, 2022) - Venue: ICML 2022 / arXiv:2212.12749 - Citations: 57 - Core idea: Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 5

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting · Haixu Wu, Jiehui Xu et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2106.13008 - Citations: 4676 - Core idea: Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 4

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis · Haixu Wu, Teng Hu et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2210.02186 - Citations: 2115 - Core idea: Time series analysis is of immense importance in extensive applications, such as weather forecasting, anomaly detection, and action recognition. This paper focuses on temporal variation modeling, which is the common key problem of extensive analysis tasks. - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 4

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation · Ofir Press, Noah A. Smith et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2108.12409 - Citations: 1220 - Core idea: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? - Why relevant here: Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures. - Code: N/A - Relevance score: 4

Theme 5: Symbolic Regression / Program Synthesis / FunSearch family (count: 16)¶

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis · Erik Nijkamp, Bo Pang et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2203.13474 - Citations: 1482 - Core idea: Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Gorilla: Large Language Model Connected with Massive APIs · Shishir G. Patil, Tianjun Zhang et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2305.15334 - Citations: 1254 - Core idea: Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions · Terry Yue Zhuo, Minh Chien Vu et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2406.15877 - Citations: 520 - Core idea: Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis · Izzeddin Gur, Hiroki Furuta et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2307.12856 - Citations: 373 - Core idea: Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

End-to-end symbolic regression with transformers · Pierre-Alexandre Kamienny, Stéphane d'Ascoli et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2204.10532 - Citations: 277 - Core idea: Symbolic regression, the task of predicting the mathematical expression of a function from the observation of its values, is a difficult task which usually involves a two-step procedure: predicting the"skeleton"of the expression up to the choice of numerical constants, then fitting the constants by optimizing a non-convex loss function. The dominant approach is genetic programming, which evolves candidates by iterating this subroutine a large number of times. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Neural Symbolic Regression that Scales · Luca Biggio, Tommaso Bendinelli et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2106.06427 - Citations: 258 - Core idea: Symbolic equations are at the core of scientific discovery. The task of discovering the underlying equation from a set of input-output pairs is called symbolic regression. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL · Mohammadreza Pourreza, Hailong Li et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.01943 - Citations: 164 - Core idea: In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs' intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) c... - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution · Wei Tao, Yucheng Zhou et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2403.17927 - Citations: 157 - Core idea: In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Transformer-based Planning for Symbolic Regression · P. Shojaee, Kazem Meidani et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2303.06833 - Citations: 93 - Core idea: Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and offering notable advantages in terms of inference time over classical Genetic Programming (GP) methods. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Parsel🦆: Algorithmic Reasoning with Language Models by Composing Decompositions · E. Zelikman, Qian Huang et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2212.10561 - Citations: 83 - Core idea: Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Self-Evolving Multi-Agent Collaboration Networks for Software Development · Yue Hu, Yuzhu Cai et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.16946 - Citations: 68 - Core idea: LLM-driven multi-agent collaboration (MAC) systems have demonstrated impressive capabilities in automatic software development at the function level. However, their heavy reliance on human design limits their adaptability to the diverse demands of real-world software development. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Symbolic Regression with a Learned Concept Library · Arya Grayeli, Atharva Sehgal et al. (NeurIPS, 2024) - Venue: NeurIPS 2024 / arXiv:2409.09359 - Citations: 62 - Core idea: We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

ODEFormer: Symbolic Regression of Dynamical Systems with Transformers · Stéphane d’Ascoli, Soren Becker et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2310.05573 - Citations: 61 - Core idea: We introduce ODEFormer, the first transformer able to infer multidimensional ordinary differential equation (ODE) systems in symbolic form from the observation of a single solution trajectory. We perform extensive evaluations on two datasets: (i) the existing"Strogatz"dataset featuring two-dimensional systems; (ii) ODEBench, a collection of one- to four-dimensional systems that we carefully curated from the literature to provide a more holistic benchmark. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Leveraging Language to Learn Program Abstractions and Search Heuristics · Catherine Wong, Kevin Ellis et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2106.11053 - Citations: 60 - Core idea: Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages · Xinyun Chen, D. Song et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2107.00101 - Citations: 56 - Core idea: Program synthesis from input-output (IO) examples has been a long-standing challenge. While recent works demonstrated limited success on domain-specific languages (DSL), it remains highly challenging to apply them to real-world programming languages, such as C. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 5

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs · Yujia Qin, Shi Liang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2307.16789 - Citations: 1634 - Core idea: Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. - Why relevant here: Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack. - Code: N/A - Relevance score: 4

Theme 6: Conformal Prediction & Distribution-Free Methods (count: 6)¶

Prompting GPT-3 To Be Reliable · Chenglei Si, Zhe Gan et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2210.09150 - Citations: 374 - Core idea: Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Learning Optimal Conformal Classifiers · David Stutz, K. Dvijotham et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2110.09192 - Citations: 126 - Core idea: Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Language Models with Conformal Factuality Guarantees · Christopher Mohri, Tatsunori Hashimoto (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.10978 - Citations: 117 - Core idea: Guaranteeing the correctness and factuality of language model (LM) outputs is a major open problem. In this work, we propose conformal factuality, a framework that can ensure high probability correctness guarantees for LMs by connecting language modeling and conformal prediction. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Uncertainty Quantification over Graph with Conformalized Graph Neural Networks · Kexin Huang, Ying Jin et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2305.14535 - Citations: 97 - Core idea: Graph Neural Networks (GNNs) are powerful machine learning prediction models on graph-structured data. However, GNNs lack rigorous uncertainty estimates, limiting their reliable deployment in settings where the cost of errors is significant. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Taming Overconfidence in LLMs: Reward Calibration in RLHF · Jixuan Leng, Chengsong Huang et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.09724 - Citations: 83 - Core idea: Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Training Uncertainty-Aware Classifiers with Conformalized Deep Learning · Bat-Sheva Einbinder, Yaniv Romano et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2205.05878 - Citations: 73 - Core idea: Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. - Why relevant here: Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs. - Code: N/A - Relevance score: 5

Theme 7: Reinforcement Learning Foundations (relevant subset) (count: 28)¶

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games · Chao Yu, Akash Velu et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2103.01955 - Citations: 2248 - Core idea: Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study · Shusheng Xu, Wei Fu et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2404.10719 - Citations: 284 - Core idea: Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Trajectory Balance: Improved Credit Assignment in GFlowNets · Nikolay Malkin, Moksh Jain et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2201.13259 - Citations: 281 - Core idea: Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Online Decision Transformer · Qinqing Zheng, Amy Zhang et al. (ICML, 2022) - Venue: ICML 2022 / arXiv:2202.05607 - Citations: 262 - Core idea: Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large-scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via taskspecific interactions with the environment. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Prompting Decision Transformer for Few-Shot Policy Generalization · Mengdi Xu, Yikang Shen et al. (ICML, 2022) - Venue: ICML 2022 / arXiv:2206.13499 - Citations: 196 - Core idea: Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints · Chaoqi Wang, Yibo Jiang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2309.16240 - Citations: 183 - Core idea: The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning · Chenjia Bai, Lingxiao Wang et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2202.11566 - Citations: 174 - Core idea: Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Structured State Space Models for In-Context Reinforcement Learning · Chris Xiaoxuan Lu, Yannick Schroecker et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2303.03982 - Citations: 146 - Core idea: Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Online and Offline Reinforcement Learning by Planning with a Learned Model · Julian Schrittwieser, T. Hubert et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2104.06294 - Citations: 142 - Core idea: Learning efficiently from small amounts of data has long been the focus of model-based reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Generalized Decision Transformer for Offline Hindsight Information Matching · Hiroki Furuta, Y. Matsuo et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2111.10364 - Citations: 128 - Core idea: How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is full... - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL · Taku Yamagata, Ahmed Khalil et al. (ICML, 2022) - Venue: ICML 2022 / arXiv:2209.03993 - Citations: 124 - Core idea: Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization · Haoran Xu, Li Jiang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2303.15810 - Citations: 124 - Core idea: Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing \(Q\)-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an op... - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration · Lu Zheng, Jiarui Chen et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2111.11032 - Citations: 118 - Core idea: Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning · Yiqin Yang, Xiaoteng Ma et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2106.03400 - Citations: 117 - Core idea: Learning from datasets without interaction with environments (Offline Learning) is an essential step to apply Reinforcement Learning (RL) algorithms in real-world scenarios. However, compared with the single-agent counterpart, offline multi-agent RL introduces more agents with the larger state and action space, which is more challenging but attracts little attention. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Constrained Decision Transformer for Offline Safe Reinforcement Learning · Zuxin Liu, Zijian Guo et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2302.07351 - Citations: 88 - Core idea: Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Towards Understanding and Improving GFlowNet Training · Max W. Shen, Emmanuel Bengio et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2305.07170 - Citations: 82 - Core idea: Generative flow networks (GFlowNets) are a family of algorithms that learn a generative policy to sample discrete objects \(x\) with non-negative reward \(R(x)\). Learning objectives guarantee the GFlowNet samples \(x\) from the target distribution \(p^*(x) \propto R(x)\) when loss is globally minimized over all states or trajectories, but it is unclear how well they perform with practical limits on training resources. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification · L. Pan, Longbo Huang et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2111.11188 - Citations: 82 - Core idea: Conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the multi-agent setting. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Elastic Decision Transformer · Yueh-Hua Wu, Xiaolong Wang et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2307.02484 - Citations: 81 - Core idea: This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning · A. Bakhtin, David J. Wu et al. (ICLR, 2022) - Venue: ICLR 2022 / arXiv:2210.05492 - Citations: 66 - Core idea: No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Coordinated Proximal Policy Optimization · Zifan Wu, Chao Yu et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2111.04051 - Citations: 65 - Core idea: We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents · Wei Qiu, Xinrun Wang et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2102.08159 - Citations: 64 - Core idea: Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Coach-Player Multi-Agent Reinforcement Learning for Dynamic Team Composition · Bo Liu, Qiang Liu et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2105.08692 - Citations: 64 - Core idea: In real-world multi-agent systems, agents with different capabilities may join or leave without altering the team's overarching goals. Coordinating teams with such dynamic composition is challenging: the optimal team strategy varies with the composition. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning · Weichao Mao, Lin F. Yang et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2110.05707 - Citations: 63 - Core idea: Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Offline Multi-Agent Reinforcement Learning with Knowledge Distillation · W. Tseng, Tsun-Hsuan Wang et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / DOI:10.52202/068431-0017 - Citations: 57 - Core idea: We introduce an offline multi-agent reinforcement learning (offline MARL) framework that utilizes previously collected data without additional online data collection. Our method reformulates offline MARL as a sequence modeling problem and thus builds on top of the simplicity and scalability of the Transformer architecture. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 5

Direct Preference Optimization: Your Language Model is Secretly a Reward Model · Rafael Rafailov, Archit Sharma et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2305.18290 - Citations: 8954 - Core idea: While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 4

Decision Transformer: Reinforcement Learning via Sequence Modeling · Lili Chen, Kevin Lu et al. (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2106.01345 - Citations: 2284 - Core idea: We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 4

Offline Reinforcement Learning with Implicit Q-Learning · Ilya Kostrikov, Ashvin Nair et al. (ICLR, 2021) - Venue: ICLR 2021 / arXiv:2110.06169 - Citations: 1482 - Core idea: Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-dis... - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 4

A Minimalist Approach to Offline Reinforcement Learning · Scott Fujimoto, S. Gu (NeurIPS, 2021) - Venue: NeurIPS 2021 / arXiv:2106.06860 - Citations: 1144 - Core idea: Offline reinforcement learning (RL) defines the task of learning from a fixed batch of data. Due to errors in value estimation from out-of-distribution actions, most offline RL algorithms take the approach of constraining or regularizing the policy with the actions contained in the dataset. - Why relevant here: Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops. - Code: N/A - Relevance score: 4

Theme 8: Benchmark / Evaluation Methodology Papers (count: 15)¶

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation · Jiawei Liu, Chun Xia et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2305.01210 - Citations: 1783 - Core idea: Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs · Miao Xiong, Zhiyuan Hu et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2306.13063 - Citations: 938 - Core idea: Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

TravelPlanner: A Benchmark for Real-World Planning with Language Agents · Jian Xie, Kai Zhang et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.01622 - Citations: 396 - Core idea: Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents · Xuhui Zhou, Hao Zhu et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2310.11667 - Citations: 323 - Core idea: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback · Xingyao Wang, Zihan Wang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2309.10691 - Citations: 323 - Core idea: To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

JudgeBench: A Benchmark for Evaluating LLM-based Judges · Sijun Tan, Siyuan Zhuang et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2410.12784 - Citations: 253 - Core idea: LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation · Qian Huang, Jian Vora et al. (ICML, 2023) - Venue: ICML 2023 / arXiv:2310.03302 - Citations: 229 - Core idea: A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning · Benjamin Ellis, Skander Moalla et al. (NeurIPS, 2022) - Venue: NeurIPS 2022 / arXiv:2212.07489 - Citations: 175 - Core idea: The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot · Joel Z. Leibo, Edgar A. Duéñez-Guzmán et al. (ICML, 2021) - Venue: ICML 2021 / arXiv:2107.06857 - Citations: 138 - Core idea: Existing evaluation suites for multi-agent reinforcement learning (MARL) do not assess generalization to novel situations as their primary objective (unlike supervised-learning benchmarks). Our contribution, Melting Pot, is a MARL evaluation suite that fills this gap, and uses reinforcement learning to reduce the human labor required to create novel test scenarios. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning · Cheng Tan, Siyuan Li et al. (NeurIPS, 2023) - Venue: NeurIPS 2023 / arXiv:2306.11249 - Citations: 113 - Core idea: Spatio-temporal predictive learning is a learning paradigm that enables models to learn spatial and temporal patterns by predicting future frames from given past frames in an unsupervised manner. Despite remarkable progress in recent years, a lack of systematic understanding persists due to the diverse settings, complex implementation, and difficult reproducibility. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation · Chufan Shi, Cheng Yang et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2406.09961 - Citations: 96 - Core idea: We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning · Bahare Fatemi, Mehran Kazemi et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2406.09170 - Citations: 91 - Core idea: Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

Evaluating Quantized Large Language Models · Shiyao Li, Xuefei Ning et al. (ICML, 2024) - Venue: ICML 2024 / arXiv:2402.18158 - Citations: 90 - Core idea: Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 5

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? · Carlos E. Jimenez, John Yang et al. (ICLR, 2023) - Venue: ICLR 2023 / arXiv:2310.06770 - Citations: 2430 - Core idea: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 4

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code · Naman Jain, King Han et al. (ICLR, 2024) - Venue: ICLR 2024 / arXiv:2403.07974 - Citations: 1609 - Core idea: Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. - Why relevant here: Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench. - Code: N/A - Relevance score: 4

Gap Analysis vs Existing Repo¶

Already Referenced In Repo¶

Paper	Existing location(s)
Reflexion: language agents with verbal reinforcement learning	docs/agent_landscape_paper_level.md
Decision Transformer: Reinforcement Learning via Sequence Modeling	docs/agent_landscape_paper_level.md
Sequential Predictive Conformal Inference for Time Series	docs/agent_landscape_paper_level.md

Not Yet Referenced: Highest-Priority Integration Proposals¶

Target file + section	Paper	Why add it
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Efficiently Modeling Long Sequences with Structured State Spaces	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	A Time Series is Worth 64 Words: Long-term Forecasting with Transformers	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/financial_agent_to_general_agent_technical_landscape.md` · L5 · Algorithmic Foundations	The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games	Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	iTransformer: Inverted Transformers Are Effective for Time Series Forecasting	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	Gorilla: Large Language Model Connected with Massive APIs	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Large Language Models Are Zero-Shot Time Series Forecasters	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Unified Training of Universal Time Series Forecasting Transformers	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/agent_landscape_paper_level.md` · L3 · LLM Agent Patterns	Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models	Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Resurrecting Recurrent Neural Networks for Long Sequences	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Diagonal State Spaces are as Effective as Structured State Spaces	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/financial_agent_to_general_agent_technical_landscape.md` · L5 · Algorithmic Foundations	Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Gated Linear Attention Transformers with Hardware-Efficient Training	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	TravelPlanner: A Benchmark for Real-World Planning with Language Agents	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 7 · Conformal Prediction	Prompting GPT-3 To Be Reliable	Maps to L5/RQ2 · Distribution-free reliability: it can supply distribution-free safety/calibration blocks for open-world agent outputs.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/financial_agent_to_general_agent_technical_landscape.md` · L5 · Algorithmic Foundations	Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study	Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops.
`docs/financial_agent_to_general_agent_technical_landscape.md` · L5 · Algorithmic Foundations	Trajectory Balance: Improved Credit Assignment in GFlowNets	Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	End-to-end symbolic regression with transformers	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/financial_agent_to_general_agent_technical_landscape.md` · L5 · Algorithmic Foundations	Online Decision Transformer	Maps to L5 · RL/search foundations for generator-verifier loops: it gives search/control machinery for generator-verifier alpha discovery loops.
`docs/alpha_search_baselines.md` · AI for Science / program-search transfer	Neural Symbolic Regression that Scales	Maps to L5/L1 · Program search and algorithmic discovery: it provides reusable agent-system machinery for the proposed research stack.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	JudgeBench: A Benchmark for Evaluating LLM-based Judges	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/agent_landscape_paper_level.md` · L3 · LLM Agent Patterns	Chain of Agents: Large Language Models Collaborating on Long-Context Tasks	Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack.
`docs/agent_research_landscape.md` · L0/L1 benchmark discussion	MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation	Maps to L2/L0 · Benchmark and evaluation methodology: it is useful benchmark design precedent for Crypto-Alpha-Bench.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	Adaptive Conformal Predictions for Time Series	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.
`docs/agent_landscape_paper_level.md` · L3 · LLM Agent Patterns	Scaling Large-Language-Model-based Multi-Agent Collaboration	Maps to L3 · Reasoning / Planning / Memory / Tool-use: it provides reusable agent-system machinery for the proposed research stack.
`docs/alpha_search_survey_taxonomy_and_bibliography.md` · Tradition 6 · Time Series Foundation Models	BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack	Maps to L5/L0 · Algorithmic foundations for RQ1 time-series priors: it informs the RQ1 prior-vs-scale debate for crypto microstructure architectures.

Git-Friendly Insert Guidance¶

For docs/alpha_search_survey_taxonomy_and_bibliography.md section Tradition 6 · Time Series Foundation Models: add Efficiently Modeling Long Sequences with Structured State Spaces; A Time Series is Worth 64 Words: Long-term Forecasting with Transformers; iTransformer: Inverted Transformers Are Effective for Time Series Forecasting; Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers; Large Language Models Are Zero-Shot Time Series Forecasters.
For docs/financial_agent_to_general_agent_technical_landscape.md section L5 · Algorithmic Foundations: add The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games; Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study; Trajectory Balance: Improved Credit Assignment in GFlowNets; Online Decision Transformer; Prompting Decision Transformer for Few-Shot Policy Generalization.
For docs/agent_research_landscape.md section L0/L1 benchmark discussion: add Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation; Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs; TravelPlanner: A Benchmark for Real-World Planning with Language Agents; SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents; MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback.
For docs/alpha_search_baselines.md section AI for Science / program-search transfer: add CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis; Gorilla: Large Language Model Connected with Massive APIs; BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions; A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis; End-to-end symbolic regression with transformers.
For docs/agent_landscape_paper_level.md section L3 · LLM Agent Patterns: add Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models; Chain of Agents: Large Language Models Collaborating on Long-Context Tasks; Scaling Large-Language-Model-based Multi-Agent Collaboration; FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making; Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast.
For docs/alpha_search_survey_taxonomy_and_bibliography.md section Tradition 7 · Conformal Prediction: add Prompting GPT-3 To Be Reliable; Learning Optimal Conformal Classifiers.
For docs/financial_agent_to_general_agent_technical_landscape.md section L3 · Multi-agent patterns: add Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates.
For docs/agent_landscape_paper_level.md section L1 · Domain Research Agents: add CycleResearcher: Improving Automated Research via Automated Review.

Reproducibility¶

Pipeline script: scripts/ml_venue_survey.py
Curated JSONL: data/ml_venue_papers.jsonl
Cache: .cache/semantic_scholar
Last refreshed: 2026-06-03T03:07:09Z
To re-run:

python scripts/ml_venue_survey.py \
  --output docs/ml_top3_venue_survey_2021_2026.md \
  --data-output data/ml_venue_papers.jsonl

Notes:

Semantic Scholar venue metadata can include sparse or alias-sensitive proceedings records; the script retries aliases when a canonical venue/year query is empty.
Stage A is a transparent Codex relevance judgement encoded as weighted topic signals so the run is reproducible without a paid LLM key. Replace judge_paper() with an API LLM call if a stricter LLM-as-judge audit is required.
2026 proceedings are incomplete as of 2026-06-03; zeros or low counts should be treated as temporal incompleteness, not venue absence.