Amazon Science homepage

Ground truth is a process, not a dataset

Wed, 03 Jun 2026 15:56:57 GMT

Today, the key challenge in AI isn’t only how to build better models; it’s how to build evaluation systems that can keep up. Search-augmented AI systems can now produce deep research reports — long, polished syntheses of many sources that increasingly resemble expert analysis. But those reports are useful only if their claims are supported by the underlying literature. Most existing fact-checking tools work best when a claim can be matched to a short quote or a single document. But in AI-generated research reports, a single sentence may combine evidence from several sources. It can depend on the surrounding report for context, and it might compare assertions in a way that no single source does on its own. When Amazon’s Artificial General Intelligence (AGI) group started working on the problem of evaluating AI-generated research reports, we thought that the main technical challenge would be building a stronger AI fact checker. But before you can evaluate an AI fact checker, you need a benchmark, a standardized test set used to measure performance. And in this setting, building the benchmark turned out to be at least as hard as building the model. Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process. We call that process audit-then-score, and we present it, together with two accompanying datasets, in a paper we recently published to arXiv. When static datasets break down In the standard method for measuring AI performance, human experts label examples, those labels become the “ground truth” (the undisputed correct answers), and models are scored against them. To test this approach with AI-generated research reports, we recruited PhD-level specialists from fields such as computer science, control theory, education, public health, and environmental engineering. We asked them to verify claims from reports in their own specialties, mixing in a hidden set of claims whose answers we already knew. The result was sobering. In a controlled study, unassisted experts achieved only 60.8% accuracy on the hidden set of known answers. The issue was not a lack of expertise. It was that assessing deep-research factuality is an unusually demanding task. Verifying a single claim can require long-context reading, cross-document synthesis, and sustained attention. Normally, in machine learning, when a model disagrees with a benchmark, we assume the model made a mistake. But we realized that, in cognitively demanding tasks like deep research, disagreement should not automatically be treated as a model failure. Sometimes, a model’s “error” is actually a signal that the benchmark itself is ambiguous, incomplete, or wrong. Audit, then score Instead of treating the initial expert labels as unquestionable ground truth, we decided to use the models to actively scrutinize the benchmark. This is the core idea behind the audit-then-score protocol. Our paper introduces the protocol alongside DeepFact-Bench, a shared test set for comparing systems, and DeepFact-Eval, a system that checks whether literature supports report claims. Here is how the protocol works: When our AI fact checker disagrees with the current benchmark answer, it is not simply penalized. Instead, it acts as a challenger and must submit concrete evidence and a written rationale for why it thinks the original human answer is wrong. An auditor — which can be a human expert — then steps in. Crucially, auditors do not start from scratch; they compare the challenger’s new evidence directly against the benchmark’s original rationale. If the challenger makes the stronger case, we revise the benchmark before we score the model. DeepFact-Eval reads the full report context, plans searches to cover the relevant literature, summarizes retrieved documents, and asks follow-up questions when key details are missing. It then produces both a verdict and a written explanation. This fundamentally changes what a benchmark is. A new role for human expertise One of the most striking things we found is that the same experts who were unreliable as one-shot labelers became far more reliable when placed in the role of auditor. Across four rounds of audit-then-score, accuracy on our hidden test set rose from 60.8% to 90.9%. When experts start from a blank page, they have to find the evidence, interpret it, and make a judgment on their own; when they audit a disputed claim, they can focus on comparing two concrete cases. This shift had significant impact. On DeepFact-Bench, DeepFact-Eval reached 83.4% accuracy when we used GPT-4.1 as the underlying model. That was higher than the 58.5% of the best traditional fact-checking system we tested and the 69.1% of a strong prior deep-research system. Evaluation as an evolving infrastructure This shift has implications beyond one paper or one task. If AI systems continue improving, to the point that they exhibit humanlike expertise, the community will increasingly run into settings where evaluation based on one-time human answers is not enough. In those settings, sustaining benchmark quality may require auditing, revision, calibration, and periodic revalidation. Evaluation will become an ongoing collaboration among humans, models, and the evidence they surface together. Acknowledgments: Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Markus Dreyer

How flat is replacing fat in AWS data center networks

Thu, 28 May 2026 10:30:00 GMT

Routing in today’s data centers is usually governed by a data structure called a “fat tree”, which is similar to a corporate organizational chart, with nodes in each layer connecting to multiple nodes in the layer below. Here, however, the nodes of the bottom layer represent routers that want to send messages to each other, and the layers above them contain extra routers that simplify the routing procedure. A message sent by one bottom-layer router climbs the tree until it reaches the branch that leads to the destination router, and then it is sent down. This design is easy to implement but inefficient: the extra layers of routers add overhead, and routers at the top of the tree are prone to congestion. The fat-tree structure is also fragile, since the loss of a single router can cut off large regions of the tree. Theoretically, the best alternative is a “flat” network, in which the routers connect directly to each other. Ideally, one should connect the routers randomly, to maximize the diversity of routes through the network. But this is impractical, because calculating ad hoc paths through a random network is computationally intensive, and randomly connecting routers leads to data centers criss-crossed with wires. In a paper we recently posted to arXiv, we describe the first ever scalable flat-network datacenter. We introduce a “quasi-random” network topology that preserves many of the benefits of random connection and a passive optical component we call a ShuffleBox, which makes it practical to cable a flat network. The resulting network design — which we call RNG, for resilient network graphs — is now used in AWS data centers and is the default for most new builds globally. It uses 69% fewer routers, delivers up to 33% better throughput, and projects a 40% reduction in network equipment electricity consumption. The secret of randomness In the early 1990s, mathematicians showed that the optimal network for routing has a random topology, in which each router simply connects randomly to a few others. This is quite counterintuitive, but the overall network ends up having lots of different paths between all pairs of routers. Random networks also demonstrate excellent resilience, since no single router is more important than any other. The loss of 1% of routers results in a roughly 1% capacity loss. Degradation is proportional and predictable rather than catastrophic and concentrated. Networking researchers have also validated these results through simulations, showing that random, flat topologies achieve better performance than the corresponding fat trees. But these results couldn’t make it in the real world. Any network design comes with a “routing protocol” that decides how packets reach their destinations. In a random network, computing and implementing the right set of routing paths can take a lot of hardware resources — well beyond what is present in commodity routers. On the other hand, using dedicated hardware for routing would be cost prohibitive. An even bigger problem is that cabling routers randomly in a datacenter is completely infeasible. Our solution is to build a “quasi-random” network topology that has exactly the right mix of random and deterministic components. Routing without structure In a fat tree, the hierarchy itself tells packets where to go. And the paths generated are guaranteed to be the shortest possible. In a quasi-random graph, there is no obvious structure to exploit. Standard approaches to multipath routing in flat topologies typically require 20 to 80 times more memory than commodity hardware is equipped with. Our key insight is that we can exploit the random structure of the topology to open up a wide range of path options in a lightweight manner. Our routing algorithm, Spraypoint, has two components. The source router “sprays” its traffic randomly to all of its neighbors. Every (destination) router has some designated “waypoints” that feed traffic to it. The main scheme is that each data packet sent from the source goes to a random neighbor, after which the classic shortest-path algorithm routes it to a waypoint, and the waypoints feed it to the destination. The utility of spraying is that traffic can take a wide variety of paths to the destination, while the waypoints prevent traffic from congesting near the destination. In the implementation, we create various “rings” around each destination, and traffic is guided from each ring to a closer ring. By spraying to neighbors, Spraypoint provides nearly twice as many independent paths between routers as standard shortest-path routing techniques. This improves the likelihood that traffic will be routed around congested pathways or failed routers. Making quasi-random cabling practical A random graph connects arbitrary pairs of routers that may sit in different rooms, hundreds of meters apart. This is the strength of the topology, since it allows for fast communication between routers. But that is also its drawback, since cabling such a structure is extremely complicated. This is where our quasi-random solution comes in. Instead of all connections being random, we fix specific parts of the network topology. Our central innovation is a passive optical device called a ShuffleBox. It has router-facing ports on one side and connects to other ShuffleBoxes on the other side. The internal wires are shuffled according to a special pattern, so that random connections between the ShuffleBoxes lead to an overall quasi-random topology. When a new rack arrives, a technician plugs its router into an available port on the local ShuffleBox. No rewiring elsewhere. The physical-cabling complexity, the number of cable runs, and the installation process are on par with those of a fat tree, even though the logical topology is quasi-random. Predicting performance before construction With any new network topology, operators need confidence that it will meet capacity and performance requirements before they commit to construction. Fat-tree topologies come with simple, well-defined models that predict performance and capacity constraints. No equivalent existed for quasi-random graphs. We developed new mathematical models for various network statistics, such as path lengths, the number of routes, and how much traffic will end up on a particular link. These models give precise formulas that network operators can use to choose design parameters. We validated those models extensively, using 530 processor-years of simulation, the equivalent of running a single CPU for half a millennium, executed on Amazon EC2. An operator can now specify a server count and a target performance level, compute the cheapest compliant topology, and be confident that it will work. From theory to production The first quasi-random network went live near Dublin, Ireland, at the end of 2024, carrying real production traffic. We validated performance against the mathematical predictions, identified operational refinements, and applied them in two additional deployments. In end-to-end benchmarks across these production fabrics, our flat topology matched fat-tree performance for multipath-transport workloads and latency-sensitive storage operations. No customer workload changes were required, and the network operates transparently beneath existing applications. By April 2026, quasi-random wiring became the default architecture for most new AWS data centers globally. The 69% reduction in the number of routers translates directly into reduced power, cooling, and operational overhead at every site. For customers, it means more resilient infrastructure behind every API call, database query, and machine learning training job, without changing a single line of code.

Amazon Research Awards recipients announced

Wed, 27 May 2026 17:21:51 GMT

Amazon Research Awards (ARA) provides unrestricted funds and AWS Promotional Credits to academic researchers investigating various research topics in multiple disciplines. This cycle, ARA received many excellent research proposals from across the world and today is publicly announcing 70 award recipients who represent 49 universities in 11 countries. This announcement includes awards funded under 6 calls for proposals during the fall 2025 cycle: AI for Information Security, Agentic AI , Automated Reasoning, AWS Cryptography, Cybersecurity and Anti-Abuse Technologies, and Sustainability Proposals were reviewed for the quality of their scientific content and their potential to impact both the research community and society. Additionally, Amazon encourages the publication of research results, presentations of research at Amazon offices worldwide, and the release of related code under open-source licenses. Recipients have access to more than 700 Amazon public datasets and can utilize AWS AI/ML services and tools through their AWS Promotional Credits. Recipients also are assigned an Amazon research contact who offers consultation and advice, along with opportunities to participate in Amazon events and training sessions. "Fraud and abuse evolve at the speed of the technologies that bad actors exploit. Since we can only defend against what we can measure, the science of studying those technologies has to keep pace," said Dhruv Kuchhal, Applied Scientist, Special Projects & Invest-Fixed. "Through ARA, we bring together experts across industry and academia to tackle these problems upstream and publish defenses that systematically raise bad actors' operating costs and erode their ROI as they spread across the ecosystem. This not only strengthens Amazon, but the broader Web, including online shopping customers, sellers and brands who build businesses online, and the platforms and payment rails that tie them together. We were impressed by the quality and volume of proposals we received — a strong signal that the field is raising the bar for Web users everywhere — and we look forward to working with the new recipients to turn this research into lasting, ecosystem-wide improvements in fraud and abuse prevention." “AI is reshaping cybersecurity faster than ever in advancing how we detect threats and defend systems, ”said Wei Ding, Applied Science Manager, GuardDuty, AWS. “At the same time, agentic AI requires stronger guarantees of safety, robustness, and trust worthiness. Since 2020, our team has funded security research that solves some of the biggest challenges for the industry. We’re pleased to continue our tradition of fostering innovation through these latest research projects addressing agentic AI security, AI-powered incident response, and threat detection in agentic AI systems and cloud environments, among other exciting areas.” ARA funds proposals throughout the year in a variety of research areas. Applicants are encouraged to visit our call for proposals page for more information or send an email to be notified of future open calls. The tables below list, in alphabetical order by last name, fall 2025 cycle call-for-proposal recipients, sorted by research AI for Information Security RecipientUniversityResearch titlePeng GaoVirginia Polytechnic Institute and State UniversityCortexCTI: A Unified Threat Intelligence Engine for Knowledge-Driven Cloud Threat Detection and ResponseGuofei GuTexas A&M UniversityNew Benchmark and Defense on Prompt Injection in Agentic AI SystemsXiyang HuArizona State UniversitySecuring Agentic AI: From Local Detection to Global AssuranceAdriana SejfiaThe University of EdinburghExploit-driven AI Agents for vulnerability detection verificationYue ZhaoUniversity of Southern CaliforniaSecuring Agentic AI: From Local Detection to Global Assurance Automated Reasoning RecipientUniversityResearch titleJonathan AldrichCarnegie Mellon UniversityA Visual Debugger for Program VerificationDalal AlrajehImperial College LondonSOLAR: Symbolic Learning for Automated Requirements ConsistencyMaria Paola BonacinaUniversity of VeronaNew Data Structure Theories and Quantifiers in CDSATJason CongUniversity of California Los AngelesBreaking the Parallelism Limit with SAT-solving AcceleratorsLucas CordeiroThe University of ManchesterCombining Formal Methods with Large Language Models in ESBMC: Enabling Automated Program Verification through AI/MLWerner DietlUniversity of WaterlooStrata-Sphere: Expressive Type Systems and Language FormalizationsKatalin FazekasTU WienPASSAT: Improved Passing of Assertion Stacks to SAT in Incremental SMT SolversSicun GaoUniversity of California San DiegoEvaluating and Improving Quantitative Reasoning in LLM Agents Using Sandbox Coding Tasks and Formal ToolsMilos GligoricThe University of Texas at AustinDocumenting and Recommending Tactics in HOL LightRonghui GuColumbia University in the City of New YorkScaling Formal Verification of Security Properties for Unmodified System SoftwareTyler JosephsonUniversity of Maryland Baltimore CountyAutoformalization for Scientific Computing in LeanJunyi LiThe University of Texas at AustinDocumenting and Recommending Tactics in HOL LightXiaorui LiuNorth Carolina State UniversityNeurosymbolic LLM Reasoning with Symbolical Soundness and Logical ConsistencyAzalea RaadImperial College LondonSoteria in Lean: Mechanising the Next Generation of Symbolic Execution ToolsDominik SchreiberKarlsruhe Institute of TechnologyResource-Efficient Flexible SAT Solving in HPC and Cloud EnvironmentsIlya SergeyNational University of SingaporeLinear Types for a Foundational Multi-Modal Program VerifierPeter SewellUniversity of CambridgeGradual Lightweight Methods for High-Assurance Cloud InfrastructurePaulo ShakarianSyracuse UniversityNon-Markovian Agentic Meta-ReasoningArmando Solar-LezamaMassachusetts Institute of TechnologySynthesizing Library Models for Static Analysis via LLMs and Conformance TestingSalil VadhanHarvard UniversityTranslating Formal Proofs of Differential Privacy via LLMsNickolai ZeldovichMassachusetts Institute of TechnologyVerifying Rust distributed system implementations using monotonic ownership state machines in VerusXuezhou ZhangBoston UniversityAuto-Formalization and Informalization through Two-Stage Reinforcement LearningTianyi ZhangPurdue UniversityScaling Interprocedural Data-Flow Analysis with LLMs AWS Agentic AI RecipientUniversityResearch titleRaman AroraJohns Hopkins University Multi-Party Differential Privacy: Unlocking Enterprise Agentic AI Fanglin CheWorcester Polytechnic InstituteAutonomous Catalyst Design with Agentic AI for Hydrogen ProductionMuhao ChenUniversity of California DavisFlowGuard: Evolutionary Red-Teaming for Safe Multimodal Web AgentsIoannis DemertzisUniversity of California Santa CruzCAMEO: Confidential Agentic Multi-component Enclave OrchestrationCaiwen DingUniversity of Minnesota Twin CitiesEnd-to-End Agentic AI for Scalable Chiplet Design with Extreme Parallelism and HeterogeneityAriel FelnerBen-Gurion University of the NegevMulti-Agent Pathfinding with Unassigned AgentsZhaomiao GuoThe University of Texas at AustinFrom Observation to Intervention: Counterfactual Multi-Agent World Models for Autonomous DrivingJiangen HeThe University of Tennessee-KnoxvilleBeyond Walls of Text: Building UI-Native LLM Agents as the Next Gateway to the InternetFan LaiUniversity of Illinois at Urbana-ChampaignReinforcing Coordination: Streaming, Exploration, and Distillation for Long-Horizon Agent LearningZiyang LiJohns Hopkins UniversityA Protocol Stack for Resource-Bound Multi-Agent AIHenry LiuUniversity of MichiganAutomating Large Scale Deployment of Infrastructure-based Safety Critical Event Detection with Agentic AIBryan Low Kian HsangNational University of SingaporeSelf-Configurable Agentic Learning via Co-optimizationChinmay MaheshwariJohns Hopkins UniversityMarkov Near-Potential Function Based MARL Training for Mixed Cooperative–Competitive Agentic AIArash NoshadravanTexas A&M UniversityA Retrieval-Augmented Dual-Attention Vision Framework for Standards-Aligned Infrastructure InspectionMuhammad ShafiqueNew York University Abu DhabiAVAAS – Automated Vulnerability Analysis Through Advanced Agentic SystemsRoni SternBen-Gurion University of the NegevMulti-Agent Pathfinding with Unassigned Agents Zhengzhong TuTexas A&M UniversityFlowGuard: Evolutionary Red-Teaming for Safe Multimodal Web AgentsLu WangUniversity of MichiganBenchmarking and Monitoring Multi-Agent SchemingYuke WangRice UniversityEmpowering Multimodal AI Agents with Continuous LearningHamed ZamaniUniversity of Massachusetts AmherstA Framework for Proactive and Collaborative AI AgentsYang ZhaoUniversity of Minnesota Twin CitiesEnd-to-End Agentic AI for Scalable Chiplet Design with Extreme Parallelism and HeterogeneityVictor ZhongUniversity of WaterlooKNOWLEDGESTORE: A Dynamic Hierarchical Memory for Scalable, Enterprise-Ready AI Agents on AWS AWS Cryptography RecipientUniversityResearch titleSri AravindaKrishnan ThyagarajanThe University of SydneyEfficient Robust Post-Quantum Distributed Key Generation and Threshold SignaturesDaniel J. BernsteinUniversity of Illinois at ChicagoFormally verified symmetric cryptographyJeremiah BlockiPurdue UniversityStronger Memory Hard Functions to Protect Passwords against Brute Force AttacksGeoffroy CouteauParis Cité UniversityPseudorandom Correlations for Threshold CryptographyYevgeniy DodisNew York UniversityMachine Unlearning and Computational Assumptions for AIZhengzhong JinNortheastern University - United States of AmericaPractical Watermarking for LLMs via Pseduorandom CodesYael KalaiMassachusetts Institute of TechnologyEnhancing AI Safety Using CryptographyJohn LiagourisBoston UniversityPushing secure MPC beyond niche applicationsRafail OstrovskyUniversity of California Los AngelesTowards Low-Latency Maliciously Secure MPC for LLMsRachel PlayerRoyal Holloway - University of LondonNew Approaches for the Linear Transform in BFV/BGVElaine ShiCarnegie Mellon UniversityPractical Secure Computation At ScaleAkshayaram SrinivasanUniversity of TorontoSimultaneous-Message and Succinct Secure ComputationDouglas StebilaUniversity of WaterlooFantASM: Fast, Auditable, and Neat AssemblyNi TrieuArizona State UniversityFuzzy Secure Computation for Real-World Noisy DataXiao WangNorthwestern University - United States of AmericaFrom Signing to Garbling: Exploring the Spectrum of Post-Quantum PrimitivesMark ZhandryStanford UniversityAlgorithms for Post-Quantum CryptographyJiaheng ZhangNational University of SingaporePractical Watermarking for LLMs via Pseduorandom CodesVassilis ZikasGeorgia Institute of TechnologyFuzzy Secure Computation for Real-World Noisy Data Cybersecurity and Anti-Abuse Technologies RecipientUniversityResearch titleGeoffrey VoelkerUniversity of California San DiegoDetecting Anti-detect Browsers at Scale Devices Sustainability RecipientUniversityResearch titleUdit GuptaCornell UniversityAgent-Driven Life Cycle Carbon Optimization for Sustainable Edge DevicesAdriana SchulzBrown UniversityIntegrating Sustainability Reasoning into Early-Stage Electronics Design

Diverse reasoning traces teach LLMs to make better decisions

Tue, 26 May 2026 15:17:06 GMT

Large language models (LLMs) are pretrained on huge volumes of unlabeled data, but afterward, they’re typically post-trained on specific tasks such as instruction following, avoiding harmful outputs, and reasoning, or providing justifications for the outputs they generate. Parallel reasoning — in which multiple, diverse reasoning paths are generated and compared for the same problem — is emerging as a key tool for understanding the limits of LLMs’ reasoning capability. It also underpins techniques for testing LLMs such as self-consistency, where multiple reasoning paths are aggregated to improve accuracy. LLMs are generally optimized for reasoning through supervised fine-tuning (SFT), in which each training example is labeled with a single, human-verified reasoning trace. Given the usefulness of parallel reasoning for evaluation, the question naturally arises, Can we expand the limits of LLMs’ reasoning capacities by training them on diverse reasoning traces for each question? In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we propose a method for doing just that, which avoids some previously identified pitfalls of parallel reasoning. To prompt a single LLM to adopt different reasoning strategies, we introduce a set of global forking tokens (such as through in the figure below) in the post-training phase, each intended to elicit a distinct reasoning mode. These tokens enable the model to generate diverse, high-quality reasoning paths for the same problem. However, naïve post-training strategies such as SFT can lead to mode collapse, where different reasoning tokens produce nearly identical behaviors. To address this, we propose set-supervised fine tuning (SSFT) — a simple and principled training approach that enables models to learn multiple distinct reasoning strategies from diverse supervision. Instead of representing reasoning with a single trace, SSFT models it as a set of complete solution paths, which arrive at the same answer through different strategies. To further teach the model which reasoning strategy to adopt in what contexts, we introduce a reinforcement learning paradigm we call global forking policy optimization. Between these two techniques, we observe gains of 5% to 7% in single-shot accuracy on standard benchmarks, indicating that improved reasoning-mode selection directly translates to better end-to-end performance. Supervised fine tuning In practice, multiple reasoning traces for the same question can be obtained by prompting multiple teacher models, sampling alternative reasoning paths from a single model, or aggregating solutions from heterogeneous sources. SSFT pairs each such trace with a dedicated forking token (e.g., through ), where each token indicates a different reasoning mode. During training, a bipartite matching step assigns traces to tokens for each question, encouraging the model to learn distinct behaviors rather than collapsing to a single pattern. The training objective sums the next-token prediction (NTP) losses across all matched pairs, evaluating each reasoning trace conditioned on its assigned control token. As a result, each forking token is specialized to a distinct reasoning strategy, and the model produces more diverse solutions — measured by pass@k, the probability that at least one of k generated answers is correct — while maintaining strong single-shot accuracy ( pass@1). Reinforcement learning While supervised training encourages the model to learn diverse reasoning strategies, it does not explicitly teach the model which strategy to use for a given question. Choosing the right reasoning mode is inherently a decision problem, making it a natural fit for reinforcement learning. We address this with global forking policy optimization (GFPO), a lightweight reinforcement learning approach that learns to select the most effective reasoning mode for each input. For a given question x, the model samples a global forking token from a distribution over control tokens (the s). The model then produces an answer conditioned on the sampled token, and the output is verified to obtain a reward signal (e.g., correct or incorrect). These rewards are converted into advantages, which are used to update the policy over forking tokens. Importantly, the generated reasoning traces are treated as rollouts: their gradients are detached and used only for computing rewards, not for direct optimization. By focusing optimization on the forking-token distribution, GFPO avoids the complexity of token-level reinforcement learning while still capturing the key decision — selecting the right reasoning mode upfront. This makes training both efficient and stable, while directly improving end-to-end performance. Together, SSFT and GFPO enable models to both learn diverse reasoning strategies and select the right one at inference time. Evaluation We evaluate SSFT+GFPO on both reasoning and coding benchmarks along two axes: (i) accuracy and (ii) diversity of reasoning. Across all settings, SSFT+GFPO consistently outperforms standard pipelines, such as SFT+GRPO. 58.80%64.22%52.07%AIME 2025 (Pass@1)AIME 2024 (Pass@1)LiveCodeBench-v5 (Pass@1)+6.84 vs. SFT+GRPO+5.37 vs. SFT+GRPO+4.94 vs. SFTBeyond accuracy, a key goal of SSFT is to address mode collapse. SSFT explicitly encourages specialization, allowing different tokens to represent distinct reasoning strategies. This leads to two important effects. First, each global forking token consistently triggers a distinct reasoning pattern. Second, this diversity improves pass@k without compromising pass@1. This contrasts with temperature-based sampling, where increasing diversity typically comes at the cost of accuracy. Below, we present a qualitative example illustrating our approach on a representative problem from the AIME 2025 benchmark, a challenging math reasoning dataset. The same question is solved using multiple qualitatively distinct strategies — such as algebraic manipulation, geometric reasoning, and case-based analysis — depending on the selected global forking token.

Making LLMs faster without sacrificing accuracy

Fri, 15 May 2026 13:00:00 GMT

Large language models (LLMs) keep getting bigger and better. But the cost of running them — generating text, answering questions, powering real-time applications — is scaling up, too. Obviously, model accuracy is important, but for real-time AI-based web applications, it can’t come at the expense of efficiency. In a paper we presented at the International Conference on Learning Representations (ICLR), we provide a framework for navigating this accuracy-versus-efficiency tradeoff, by connecting scaling laws directly to architectural-design decisions. The gap in current scaling laws In 2022, Google DeepMind announced the results of a study involving an experimental LLM called Chinchilla. The DeepMind researchers demonstrated a scaling law that enabled joint optimization of model size and training data to achieve a desired loss level, given a particular computational budget. More precisely, the law relates the model loss (L) to the number of model parameters (N) and the number of tokens in the training dataset: The other variables in this equation — E, A, B, α, and β — are all learnable coefficients. The DeepMind researchers did extensive experimentation to tune those coefficients. This "Chinchilla law" doesn't specify architectural choices, such as the size of the model's internal representations — the "hidden size" — or the relative number of parameters allocated to attention layers and multilayer perceptron (MLP) layers. However, two models, each with the same billion-parameter count, trained on the same data, with the same accuracy, can differ by up to 40% in inference-time throughput, depending on additional architectural choices. We set out to deduce scaling laws that can help predict those choices. The Transformer architecture The Transformer architecture — which lies at the heart of all LLMs — consists largely of stacked attention and MLP blocks. Attention blocks determine how much weight to give each prior token (word or word part) when updating the current token's representation; MLP blocks transform that representation further and are where much of the model's learned knowledge is stored. A separate output layer at the end of the stack converts the final representation into a probability distribution over the next token. The attention mechanism uses three matrices, with names borrowed from information retrieval: the query matrix encodes what each token is looking for in the rest of the sequence; the key matrix encodes what each token has to offer; and the value matrix holds the content each token can contribute when it's attended to. Comparing queries against keys tells the model how relevant each token is to each other token. Most LLMs use multihead attention: several attention computations run in parallel, each with its own query, key, and value projections. Different heads tend to specialize in different aspects of the input, letting the model capture a richer set of relationships than a single head would. Our approach: Architecture as a first-class variable In our ICLR paper, we introduce a scaling law that augments the Chinchilla framework with three architectural factors: the hidden size (the dimension of the vectors that flow through the embedding, attention, and MLP blocks); the ratio of the number of MLP parameters to the number of attention parameters; and grouped-query attention (GQA), in which groups of attention heads, while preserving distinct query matrices, share key and value matrices. Each factor has a direct impact on inference throughput: Hidden size (d_model): Under a fixed parameter budget, larger hidden sizes reduce total inference FLOPs and shrink the key-value cache, improving throughput. MLP-to-attention ratio (r_mlp/attn): A higher ratio allocates more parameters to the MLP and fewer to attention, shrinking the key-value cache and reducing memory-bandwidth bottlenecks. Grouped-query attention (GQA): Compressing key-value heads further cuts input/output costs during generation. Adjusting these factors purely for higher throughput comes at a cost of accuracy. Both hidden size and MLP-to-attention ratio exhibit U-shaped loss curves: there is an optimal point for each, and pushing too far in either direction has a negative effect on model accuracy. GQA has a more erratic effect on loss, so we treat it as a discrete hyperparameter tuned through local search. We deduce our scaling law in two stages. First, we fit the standard Chinchilla law to the model under investigation, calculating values for the coefficients E, A, B, α, and β. This establishes an optimal reference loss. Then we calibrate how each architectural choice — differences in the three factors we consider — affects that loss. Effectively, we learn a correction surface over the design space. Because the effects of hidden size and MLP-to-attention ratio on loss turn out to be separable, each factor can be optimized independently. Two model families: Panda and Surefire This scaling law enabled us to develop a search framework that identifies Pareto-optimal architectures for any given accuracy target. The result of that search was two model families: Panda (which maximizes accuracy) and Surefire (which is Pareto optimal on the accuracy–efficiency frontier). To validate the framework and identify our families of optimal models, we trained more than 200 models with varying architectures (80 million to three billion parameters, eight billion to 100 billion tokens). The results of our experiments are below (throughput measured on H200 GPU with batchsize-128-4096-input-1024-output tokens): Modeld_modelGQAr_mlp/attnLossAvg. accuracyThroughput vs. LLaMA-3.2-vLLMThroughput vs. LLaMA-3.2-SGLangLLaMA-3.2-1B204844.802.80354.9%baselinebaselinePanda-1B256041.072.78257.0%-33%-Surefire-1B256093.602.80455.4%+21%+47%LLaMA-3.2-3B307234.802.62561.9%baselinebaselinePanda-3B409631.002.61962.5%-23%-Surefire-3B409671.002.62062.6%+12%+17% The billion-parameter Panda model gains 2.1% over LLaMA-3.2-1B, and the three-billion parameter model gains 0.6% over LLaMA-3.2-3B — at the cost of lower throughput. Surefire models match or exceed LLaMA-3.2 accuracy while improving throughput by 12-47%, with gains reaching up to 42% on A100 (vLLM) and 47% on H200 (SGLang) under different model size and batch size configurations. Key takeaways Architecture is not an afterthought. The optimal MLP-to-attention ratio of LLaMA-3.2-style models is around 1.0, far lower than that of existing open-weight versions (e.g., 4.8 for LLaMA-3.2-1B). Current models overallocate to MLP layers. The right configurations of hidden size, MLP-to-attention ratio, and GQA configuration can unlock large efficiency gains with no accuracy cost. Small-scale experiments predict large-scale outcomes. The conditional scaling law, calibrated on models with as few as 80 million to 297 million parameters, reliably predicts the best architecture at one billion and three billion parameters, enabling low-cost exploration before expensive full-scale training. The framework generalizes across hardware and serving systems. Efficiency gains are consistent across A100/H200 GPUs and vLLM/SGLang, making the results directly actionable.

Promptimus: Improving already good LLM prompts with zero manual engineering

Thu, 14 May 2026 13:47:45 GMT

Large language models (LLMs) have become integral to enterprise applications across industries. Under the hood, customers’ inputs to the models are usually augmented with prompts that encode intricate business logic, regulatory requirements, and domain expertise: a healthcare system must use language compliant with the Health Insurance Portability and Accountability Act, for instance, and a financial trading system must follow risk tolerance rules. These prompts are typically crafted by domain experts over weeks or months. Yet business demands continue to push for further performance gains. The challenge, therefore, is not engineering prompts from scratch but rather elevating already strong performance by discovering nuanced, task-specific refinements — without compromising domain requirements. In this post, we present Promptimus, a method for automatically optimizing well-developed prompts that has several advantages over its predecessors: It's model agnostic: It takes a prompt already optimized for a source model, rapidly reoptimizes it for a target model, and compares the optimized prompts across models. It's driven by performance criteria: It takes the existing prompt template, task-specific data samples, and user-defined performance metrics and generates targeted improvement strategies, iterating repeatedly to achieve domain-specific optimization objectives. It focuses on exploits: It uses a metric-analyzer AI agent to identify failure points and a debugging helper agent to identify root causes, and it surgically refines prompts relative to failures (rather than along random dimensions) for targeted performance improvement. It’s fully automated: It analyzes user-defined metrics and uses a code sanitization AI agent to generate debugging checkpoints automatically. Metric functions can be imported as Python code, and performance criteria can be added or modified at any time. It has an edit mode: For large, carefully structured prompts with complex business logic, the edit mode makes surgical, targeted modifications instead of rewriting the entire prompt — preserving the parts that already work while fixing exactly what’s broken. Promptimus supports a wide range of textual and multimodal LLM tasks, including classification, extraction, generation, summarization, code generation, and tool use. In the following sections, we’ll present our methodology, the system architecture, and experimental results on multiple enterprise tasks. Why good prompts are hard to improve Attempts to automate prompt optimization are as old as prompt engineering itself, but approaches that work well when generating prompts from scratch struggle to improve well-engineered prompts. Random exploration strategies using generic directions like "be more creative" or "add examples" are ineffective, because the remaining improvements lie in very specific strategic directions. Sparse feedback in the form of scalar scores provides no guidance on why instances fail or how to improve. On top of growing complexity from business domain demands, rapid model evolution further compounds the challenge of prompt optimization. As providers like Anthropic, OpenAI, Google, Meta, and Alibaba release new models, enterprises face recurring prompt migration challenges. Prompts optimized for one model often underperform on another due to different instruction-following characteristics. Manual reoptimization is costly and time consuming, and regression risks delay adoption of better models. Methodology and system design Promptimus addresses these challenges with a methodology built around a four-step iteration loop, with the following inputs: the LLM you aim to use for inference the initial prompt template a small JSONL dataset (typically 20–50 samples) with corresponding variables for prompt templates, split into a development set (for prompt tuning) and a held-out test set (for validation); it is not mandatory for the samples to contain the ground truth a user-defined performance-evaluation metric function (you can bring your own Python code) The four-step iteration loop Step 1 — evaluation: During initialization, the original prompt is executed on the target LLM using the development set (dev set) to establish baseline evaluation scores. Additionally, the metric-analyzer agent performs analysis of the user-defined metric function, generating checkpoint functions that decompose the evaluation into intermediate validation steps. These checkpoints enable fine-grained failure diagnosis throughout the optimization process. For example, when the checkpoints reveal that 98% of outputs have the correct JSON format, and 95% have valid schemas, but only 88% have valid values, the cause of underperformance is localized to value validation. After the initial evaluation, Promptimus branches into either standard mode, where it conducts full prompt rewrites, or edit mode, where it modifies prompts with structured find-and-replace edits. Standard modeEdit modeStep 2Feedback generation: The LLM-driven feedback generator uses the metric checkpoints precomputed by the metric analyzer to diagnose failure patterns in the current-prompt results. It identifies the bottleneck checkpoint (the one with the lowest pass rate) and collects representative instances — including both failing and passing examples, to provide contrast — then analyzes root causes and common failure modes. Finally, it provides actionable suggestions for fixing the prompt (such as "model outputs descriptive text instead of enum codes, suggest adding explicit constraint").Analysis + strategy + edit generation: After performing the same failure analysis as in the standard mode, the feedback generator proposes targeted find-and-replace edits, pinning changes to the exact locations responsible for specific failures. Step 3Strategy + full rewrite: Based on the feedback from the previous step, along with the metrics and data samples, the metaoptimizer analyzes task characteristics and generates task-specific exploration strategies, while maintaining all domain-specific requirements encoded in the original prompt. Then, for each strategy, the instruction optimizer proposes an improved prompt candidate that addresses the identified weaknesses and specific error patterns. This one-to-one coupling between strategies and candidates ensures diverse exploration of the optimization landscape. Programmatic edit application: For each proposed edit in step 2, Promptimus deterministically matches the edit to the identified failure with three match levels: exact match, whitespace-normalized fuzzy match, and similarity match near line reference. This process has a 97.3% success rate with zero LLM calls.Step 4Candidate evaluation: Each candidate is executed using the dev set, and the best candidate is selected by running the user-defined metric function. The best-performing candidate becomes the starting point for the next iteration. This exploration-focused process runs iteratively for a user-specified number of iterations, with each iteration building on what was learned and achieved in the previous one.We recommend standard mode for short prompts that need significant expansion — for example, a two-line math prompt that needs to grow into detailed reasoning protocols. Edit mode is a better choice for longer and already well-crafted prompts containing structured content like API schemas, compliance rules, or domain taxonomies, where full rewrites risk silently dropping or reorganizing carefully crafted sections. For a prompt with 50,000–100,000 tokens, a typical iteration produces three to five edits totaling 500–1,000 tokens, versus regeneration of the entire prompt. More generally, Promptimus adds content only when the optimization loop surfaces unaddressed failure modes, so prompt length plateaus within the first few iterations. This means that the relative serving-time impact is small for already long production prompts and larger for short starter templates. If the optimized prompt is served as a cached system prompt, the additional cost is one call during the cache's time to live, which becomes negligible at scale. Empirical experiments and analysis We evaluated Promptimus against six leading automatic prompt optimization methods across 20 public benchmarks spanning reasoning, math, question answering, text-to-SQL, coding, function calling, instruction following, and multimodal tasks. All methods used the same optimizer model and evaluation budgets with Claude Sonnet 4.6 as the target model, averaged over five random seeds. Each benchmark used 20 dev samples for optimization and 100 held-out test examples for evaluation. As reported in the table below, Promptimus achieves the best result on 16 of 20 benchmarks and ties on one, outperforming all six baselines on average (0.792 vs. 0.765 for the best-of-six baseline). The largest gains appear on tasks where the metric has a decomposable structure. Notably, Promptimus with edit mode outperforms all four multimodal benchmarks, suggesting that vision-language prompts benefit from preserving existing visual-analysis structure rather than rewriting it. BenchmarkMetricNo optimizationBest of six baselinesPromptimusModeBBH-CausalJudgeAcc [0,1]0.5380.726 (GEPA)0.718StandardBBH-DisambigQAAcc [0,1]0.6010.868 (GPO)0.908StandardBBH-GeoShapesAcc [0,1]0.7470.770 (OPRO)0.936StandardBBH-RuinNamesAcc [0,1]0.9180.926 (GEPA)0.928StandardBBH-SnarksAcc [0,1]0.3240.920 (OPRO)0.908EditGSM8KAcc [0,1]0.6580.964 (MIPROv2)0.958StandardDAPO-AIMEAcc [0,1]0.7030.730 (ProTeGi)0.79StandardHotPotQAF1 [0,1]0.160.832 (MIPROv2)0.839StandardSpiderExAcc [0,1]0.680.846 (GEPA)0.85EditBIRDExAcc [0,1]0.6260.684 (ProTeGi)0.684StandardBigCodeBench-hardPass@1 [0,1]0.3390.336 (ProTeGi)0.345StandardCodeforcesPass@1 [0,1]0.5890.808 (TextGrad)0.818EditBFCLAST [0,1]0.8820.968 (MIPROv2)0.98StandardNesT-FuLPMacc [0,1]0.3750.429 (TextGrad)0.469StandardIFBenchAcc [0,1]0.4980.509 (GEPA)0.53StandardIFEvalStrict [0,1]0.8760.886 (GPO)0.892StandardMathVistaAcc [0,1]0.4330.606 (GPO)0.644EditChartQARelaxed Acc [0,1]0.2790.828 (ProTeGi)0.834EditAI2DAcc [0,1]0.8340.824 (MIPROv2)0.868EditDeFactifyAcc [0,1]0.8350.922 (MIPROv2)0.938EditAverage0.5950.7650.792The figure below shows convergence through iterations on two representative benchmarks. Promptimus edit mode reaches 90% of its final development score in a median of about 300 metric calls, faster than all baselines. Both modes typically plateau within eight iterations, with the bulk of improvement concentrated in the first three to five iterations. Importantly, dev set gains transfer to the held-out test set. Sometimes baselines match or even exceed Promptimus on dev but fall behind on test, indicating overfitting. We attribute this to edit mode's surgical modifications, which preserve generalizable prompt structure, and metric probing, which produces failure signals that transfer across examples, as opposed to memorization of dev-set patterns. We also evaluated Promptimus across multiple LLMs using a public benchmark and Amazon enterprise use cases, spanning the tasks of classification, text-to-SQL, math reasoning, coding, multimodal understanding, and complex API generation on seven target models. Promptimus improved baseline prompts on all nine tasks, with gains ranging from 3.18% to 90.27%. Dev sets ranged from 30 to 160 examples, with the majority of tasks using fewer than 100, demonstrating the system's sample efficiency. The results also highlight model-agnostic generalizability: the same optimization framework produced meaningful gains across both proprietary and open-source target models without task-specific engineering. TaskTarget LLMPerformance metricDev set sizeNo optimizationOptimizedComplex API call generationGPT-OSS-120BAPI Acc (user-defined) [0,1]430.450.86Classification_ANova ProF1 score and FPR score [0,1]2100.640.78Multimodal classification_BHaiku-4.5Accuracy [0,1]1600.510.76Classification_CNova LiteAccuracy [0,1]850.560.58Text2sql_ANova-MicroExecution Accuracy [0,1]500.720.83Math reasoning_AQwen3-235B[WS12] (non-reasoning)Accuracy (user-defined) [0,1]300.470.50Math reasoning_BClaude-4.5-Opus (non-reasoning)Accuracy (user-defined) [0,1]300.600.73Coding_AGPT-OSS-120BPass@1 [0,1]1000.260.33Coding_BGPT-OSS-120BPass@1 [0,1]310.560.64Following are examples of how Promptimus improved already fine-grained prompts to further drive application performance for a variety of use cases. Example 1: CodeForces (coding benchmark designed to evaluate LLM reasoning) This use case is to use an LLM to generate a Python function based on a user-provided problem description. We used 50 dev samples (sampled from the original dev set) and 148 test samples with a user-defined scoring approach. The Promptimus (edit mode) optimization converged in five iterations. Original vs. optimized prompt (deletions in italic, additions in bold)-When tackling complex reasoning tasks, you have access to the following -actions. Use them as needed to progress through your thought process. -[ASSESS] -[ADVANCE] -[VERIFY] -[SIMPLIFY] -[SYNTHESIZE] -[PIVOT] -[OUTPUT] -You should strictly follow the format below: -[ACTION NAME] -# Your action step 1 -# Your action step 2 -... -Next action: [NEXT ACTION NAME] +You are an expert competitive programmer. Solve the given programming +problem in Python using the strict 2-phase reasoning structure defined below. + ## ABSOLUTE RULE – ONE [OUTPUT] BLOCK ONLY – ZERO EXCEPTIONS + The first [OUTPUT] block encountered is the ONLY one evaluated. A second [OUTPUT] block causes + immediate evaluation failure and a score of 0. + ## CRITICAL CONSTRAINTS + Standard Library Only – Use ONLY Python standard library modules. No exceptions. + Forbidden: sortedcontainers, numpy, scipy, pandas. Allowed: bisect, heapq, collections, math, + itertools, functools, sys. + If you need a sorted structure: implement using bisect + a plain list. + Sorting Pitfall Warning: + Never use sort(reverse=True) when the secondary sort direction differs from the primary. + Descending by key A, ascending by key B: items.sort(key=lambda x: (-x[0], x[1])) + I /O Consistency Rule: + Use exactly ONE I/O method throughout – no mixing. + Strategy A: input = sys.stdin.readline at top, then use input() everywhere. + Strategy B: use sys.stdin.readline() directly everywhere. + Variable Initialization Rule: + Declare all variables that are conditionally assigned BEFORE their conditional block. + ## STRICT 2-PHASE STRUCTURE + ### PHASE 1 – [ASSESS] (ONE block only) + 5 mandatory gates (G1–G5). Each gate requires a one-line YES/NO + justification. + G1 – Brute force feasible? Is O(nˆ2) within time constraints? + G2 – All variables initialized before conditional use? + G3 – I/O strategy chosen and consistent? Declare exactly one strategy. + G4 – Demo output reproducible by hand? Perform explicit dry run on demo input. + G5 – Any mutable structure modified during iteration? Confirm index recomputation. + End with: Chosen approach: [algorithm name], O([complexity]) – Tier [1/2/3] + Tier 1 = Brute-force correct, Tier 2 = Optimized correct, Tier 3 = Optimal. + Fallback Rule: If you cannot confidently implement Tier 2+, commit to Tier 1. A slow, correct + solution scores higher than a fast, broken one. + ### PHASE 2 – [OUTPUT] (ONE block only, immediately after ASSESS) + First line inside [OUTPUT] must declare I/O strategy as a comment. + Produce the complete Python solution. No other action types permitted. + ## CRITICAL OUTPUT RULES + 1. Exactly ONE [OUTPUT] block. Fix mistakes inline – never open a second. + 2. Inside [OUTPUT], the ONLY content is the fenced Python code block. + 3. Reasoning word budget: entire [ASSESS] block must not exceed 250 words. + 4. No trailing empty lines in output. + 5. Never end your response with only reasoning – even brute-force is acceptable over no solution. + 6. Never output -1 or “no solution” if the problem guarantees a solution always exists. + [. . . mandatory code scaffold template with I/O strategy declaration, imports, solve() structure, sorting/mutation reminders, output + formatting rules . . . ] Title: {problem_title} Time Limit: {time_limit} Memory Limit: {memory_limit} Problem Description: {problem_description} Output Specification: {output_specification} Demo Input: {demo_input} Demo Output: {demo_output} Note: {demo_note} -Write Python code to solve the problem. Present the code in “‘python ... “‘ at the end. +Solve the problem using the 2-phase structure: [ASSESS] block (5 mandatory gates G1–G5, ≤250 words), +then [OUTPUT] block (fenced Python solution)Example 2: Multimodal AI agent This AI agent is for Amazon to detect construction defects. The original and optimized prompts are shown below. We used the vision-language model qwen3-vl-235b-a22b on Amazon Bedrock to examine the images taken by inspectors and identify construction defect categories and risk levels. The optimization process looped in three iterations with 16 dev samples. The recommendations generated by the metric analyzer and instruction optimizer in Promptimus (including providing a role, a task objective, defect categories with examples, a category disambiguation section, analysis instructions with a decision tree, output format requirements, and critical output requirements) improved the image classification accuracy from 0.438 to 0.812. When we applied the optimized prompt to the test sample set (17 samples), accuracy improved from 0.471 to 0.529. Example 3: Defactify (multimodal fact verification) This is a comprehensive framework for evaluating an LLM’s ability to perform multimodal fact verification, detect misinformation, and identify AI-generated content. The Promptimus metric analyzer found that the model defaults to ''Real'' for photorealistic AI-generated images. The optimizer introduces an adversarial dual-hypothesis framework with asymmetric weighting that biases the model toward “AI-generated”. For example, with the original prompt, the model dismisses a clock with garbled numbers as an “artistic design choice” and is fooled by photorealistic textures. After optimization, by contrast, the adversarial dual-hypothesis protocol forces systematic signal enumeration, catching the garbled clock numerals that the baseline dismissed. Conclusion and future work Compared to other metric-driven prompt optimization approaches, Promptimus excels at preventing exploitation through targeted and exploitation-focused refinements. It is fully generalizable, adaptive to user-defined metric functions and task domains without manual engineering. The dense feedback loop drives automatic analysis on metric-function code, identifies debugging checkpoints, and generates adaptive, task-aware exploration strategies that target the specific failure modes of each prompt-and-task combination. Particularly, our approach is sample efficient, requiring only a small number of dev examples (typically 20–50) to drive significant improvements, fitting it for enterprise scenarios where labeled data is scarce or expensive to obtain. Furthermore, its model-agnostic design enables it to rapidly adapt prompts to target models for seamless enterprise-level model migration. We are making this innovation available through Amazon Bedrock to enable model migration for enterprise generative-AI applications with zero manual engineering and minimal labeled datasets.

Navigating uncertainty in Amazon's middle-mile network

Wed, 06 May 2026 13:37:38 GMT

Before the "last mile" delivery driver sets off for your home, your Amazon item has moved through the middle-mile network of fulfillment centers and sort centers, which brings products close enough to customers to make our same-day or next-day shipping promises possible. For years, Amazon engineers and scientists have been pushing computational boundaries to optimize this network under uncertainty, and that push has accelerated as the network has grown more complex. What happens when a huge snowstorm closes major highways, a sort center is hit by a power outage, or demand for a viral product spikes? These headline disruptions get attention because they're obvious system shocks that vividly illustrate the challenge of planning for uncertainty. But the most important sources of uncertainty are far more subtle: the day-to-day variations in demand and travel times that, if you don't look closely enough, erode efficiency across the entire network. We've found that even when we consider just demand variability, optimizing for uncertainty promises potential savings of 0.5%. This is a small percentage, but we obsess over small percentages because real customer experiences lie behind them. And demand variability is just one piece of a puzzle that includes road delays, processing time fluctuations, and countless other microvariations. Months before a customer clicks "Buy Now", Amazon's logistics experts consider a multitude of middle-mile routing questions: What routes should trucks take between warehouses? When should shipments depart? Where should inventory be positioned to meet customer demand? The proactive shaping of the network's structure and timing is called network design. Our challenge is not to optimize for perfect conditions but rather to build plans that remain effective even when things don't go as expected. A computational puzzle of staggering complexity Even if we could count on perfect conditions, optimizing the middle-mile network is challenging because it requires coordinating tens of millions of different products moving through hundreds of facilities, each with limited capacity and specific operating hours. A key difficulty is the mix of optimization decisions involved. Some are a matter of degree (what volume of packages to send down a particular route). Others are binary (open this shipping lane or not; depart now or wait for more cargo). Put them together, and you get what’s called a mixed-integer optimization problem, a kind of problem where the solution strategies explode combinatorially in both computational time and memory space. Consider that with only 300 yes-or-no decisions, there are already more possible combinations than atoms in the observable universe. Amazon's network involves millions of such decisions, compounded by delivery windows that restrict when shipments can arrive or depart. State-of-the-art optimization software struggles to solve this problem, even with “perfect information”. In the real world, information is far from perfect, and a plan that looks optimal on paper can unravel when conditions change. The challenge of handling uncertainty Uncertainty shows up in two different ways. First, those day-to-day fluctuations in variables like demand or travel times. Second, the structural shocks: a weather-driven road closure or unexpected facility shutdown. In academic work, a common strategy is to model many scenarios the network might face and then "robustify" the solution so that it performs well across them. But at Amazon's scale, this approach founders. There is always a staggering number of things that can go wrong, and trying to robustify against each of them individually is a hopeless task. Instead of chasing an impossible guarantee, we shift to a more practical goal: optionality. Our aim is to design a system with enough alternative routes and workable options that day-to-day fluctuations and shocks trigger effective adaptation rather than crises. In practice, our sought-after flexibility requires designing candidate networks with built-in options and stress-testing those designs against many plausible futures. That’s where Amazon’s in-house computational tools come in. Making network design tractable Amazon’s network design tool makes the middle-mile-network problem solvable at scale. It starts with a simple insight: not every possible route is worth considering. If you were planning a road trip, you would naturally focus on a handful of sensible routes. The tool applies this principle by identifying possible “consolidation points”, such as sort centers where packages from multiple origins can share trucks to common destinations, and then finding efficient routes that use them. We must also respect the clock, because Amazon facilities run on precise operational schedules. For example, a sort center might accept inbound shipments from 2:00 a.m. to 6:00 a.m. and dispatch outbound trucks from 8:00 a.m. to 12:00 noon. Ideally, planners would model these schedules at fine resolution (say, 15-minute intervals), but this creates another explosion of possibilities. On the other hand, a coarse resolution of, say, 24-hour intervals would make for fast but useless planning: packages would arrive after a facility has closed for the night, and trucks would be scheduled to depart before loading their cargo. Amazon planners overcame this stubborn problem while still supporting operational reality. The optimization approach solves at a fairly coarse time resolution, but for each candidate route, it includes precomputed “timing bounds” — the latest feasible truck departure and earliest feasible arrival — with 15-minute precision. That way, when the tool chooses routes, it's choosing those that will work on real-world schedules. Risk-aware network adjustments at scale Even with these algorithmic advances, the solution to a single deterministic planning problem of Amazon’s scale can take hours to compute because of the difficulties of parallelizing the underlying algorithm. Adding uncertainty compounds the challenge. One naïve way to account for uncertainty on the middle-mile network would be to simplify the problem by assuming that more packages flow between locations that are large and close together. But the middle mile isn’t a set of independent pipes. Product flows interact. A spike in demand at one fulfillment center affects nearby facilities in particular ways; a new delivery station changes a region’s patterns. To better capture those complex dependencies, we developed an approach enabling risk-aware network design via Monte Carlo methods. Amazon's risk-aware network-design models start by creating many permutations of synthetic origin-destination flow data to represent both day-to-day fluctuations in demand as well as larger structural shocks. One critical component of the models is a graph attention network model that represents the middle-mile network as two interconnected graphs. The first is a site graph whose nodes represent fulfillment centers and delivery stations, with the edges representing both shipping routes and geographic proximity. This allows the model to learn spatial patterns, such as higher demand around dense population centers. The second graph works at a higher level: each node represents a specific origin-destination pair. This structure lets us see correlations too subtle for the site graph to capture. It is like understanding traffic patterns: knowing that two highways are close (site graph) doesn't tell you whether they compete for the same commuters (origin-destination graph). To illustrate, consider two nearby fulfillment centers in northern Connecticut, both serving New York City. A model using only the site graph might estimate that each facility sends 8,000 packages to NYC, when in reality the volumes are much lower because the two facilities share that demand. The site graph understands that the fulfillment centers are proximal, but it doesn't fully capture that their flows to NYC are interdependent. The origin-destination graph solves this by representing each facility-to-destination pair as its own node, allowing the model to learn that when two similar facilities serve the same area, their shipments are interdependent. More broadly, this structure lets the model discover that origin-destination pairs with similar characteristics — such as suburban fulfillment centers delivering to urban areas — may exhibit correlated demand, even when they are far apart. Armed with realistic demand scenarios that respect spatial correlations and understand how network disturbances propagate across space, we can generate candidate network designs that work well under a variety of demand conditions. Keeping delivery promises under uncertainty Because the models train on historical shipping data, we can generate realistic demand scenarios that respect spatial correlations. And crucially, because the tools understand how network disturbances propagate across space, we can produce plausible scenarios the network has never encountered before, such as demand shifts driven by a new facility opening or a major regional weather event. That’s the missing half of the loop: one product designs candidate future networks, while another generates the scenarios to stress-test them. So instead of optimizing a single forecast, Amazon planners can evaluate their network designs across hundreds of plausible scenarios and preserve the options that keep the network flexible in the face of uncertainty. Overall, this enables us to distinguish between network designs that appear efficient on average but are fragile under stress and those that may incur slightly higher steady-state costs yet deliver more-stable performance. For customers, this research translates into more-reliable delivery promises, including during peak shopping periods and genuine disruptions. By combining advanced optimization techniques with machine learning, Amazon is building a middle-mile network designed to adapt to the world as it really is. So when a winter storm buries a region under two feet of snow on the same day a new must-have product goes viral, the network can absorb the shock and recover as quickly as conditions allow. But the work of building resilience against uncertainty is not finished. As the network grows, so does our commitment to advancing the computational tools that keep delivery promises reliable, day after day.

How mechanism design theory helps optimize Amazon-vendor collaboration

Tue, 05 May 2026 13:11:23 GMT

When Amazon places a purchase order with a vendor, a deceptively simple question arises: how many units should go to which fulfillment center, and when? Amazon optimizes this decision based on its demand forecasts, inventory positions, and transportation costs. The vendor, meanwhile, has its own production schedules, warehouse locations, and shipping economics. Each side optimizes independently, and the result is often a plan that is suboptimal for both, resulting in higher costs for all. This is a classical problem in economics: “ coordination under asymmetric information”. Each party holds cost and capacity data the other cannot observe, yet their decisions are deeply intertwined. The theoretical tools for solving such problems have existed for decades, rooted in mechanism design, the branch of economics that asks whether transaction rules can be designed so that self-interested parties nonetheless produce an outcome that is good for everyone. Specifically, solutions to this problem tend to involve the Vickrey-Clarke-Groves (VCG) framework, one of the foundational results in mechanism design. What has been missing is a practical architecture that makes these ideas work at supply chain scale. In new work, my colleagues in Amazon’s Supply Chain Optimization Technologies (SCOT) organization and I show how combining VCG with Amazon's consensus planning protocol (CPP), a distributed, agent-based optimization framework, achieves exactly this. The resulting system, called Flo Pro, was successfully piloted over nine weeks with a prominent consumer-product manufacturer, demonstrating that the theory translates into real cost savings. The coordination gap To understand the opportunity Flo Pro presents, consider what happens today. Amazon issues purchase orders under a just-in-time (JIT) policy: it decides when, where, and how many units it wants, and the vendor decides how much of each order to fulfill. Within this sequential, noncooperative process, there’s room for further optimization: A vendor might be able to ship far more cheaply to one fulfillment center than another, but Amazon's JIT orders don't incorporate this information. Conversely, the vendor doesn't know Amazon's downstream demand patterns or outbound transportation costs, resulting in information asymmetry and preventing both parties from identifying the most cost-effective solution for all. The potential benefits from collaboration are easy to state in principle. If both parties shared their information, they could compute an optimized supply plan minimizing total supply-chain cost inbound and outbound, production and fulfillment. The hard question is how to realize these synergies when neither party wants to fully reveal its proprietary cost structure. From auction theory to supply chain coordination The VCG mechanism, named after William Vickrey, Edward Clarke, and Theodore Groves, achieves two ends simultaneously: social efficiency (the outcome maximizes total welfare) and incentive compatibility (every participant's best strategy is to report truthfully). The classic application is auctions, but the logic is far more general. In our setting, VCG works as follows. Amazon and the vendor each submit implicitly, through their optimization agents, their true preferences about supply plans. The mechanism computes the jointly optimal plan, then determines a transfer payment that equals each party's externality on the other. Concretely, the vendor pays Amazon an amount equal to the cost Amazon incurs by deviating from its preferred JIT plan to accommodate the proposed solution, a payment known as a cost-benefit transfer (CBT). Because this payment structure makes truthful reporting a dominant strategy, neither party benefits from misrepresenting its costs, regardless of what the other side does. CPP as the computational backbone The theoretical elegance of VCG faces a well-known practical barrier: it requires agents to submit their complete utility functions. In high-dimensional supply-chain problems with dozens of fulfillment centers, multiple products, and rolling weekly horizons, this is infeasible. This is where CPP comes in. CPP is a distributed optimization protocol based on the alternating-direction method of multipliers (ADMM). Rather than asking agents to reveal everything at once, CPP works iteratively. A central coordinator proposes a consensus plan and a set of prices. Each agent responds with its preferred plan given those prices — a "best response" that requires solving only the agent's own local optimization problem. The coordinator then updates the proposal, and the process repeats until convergence. The connection between CPP and VCG is natural and deep. CPP submissions from a truthful agent — one that honestly optimizes in response to each query — are equivalent to the submission of a truthful utility report in the direct VCG mechanism. The CPP iterations serve as the computational engine that finds the socially efficient plan. A second CPP run with one agent removed computes the cost to the remaining agent of its preferred plan; this provides the counterfactual needed to calculate the VCG transfer. The outcome is identical to that of the direct mechanism, but the information requirements are radically lighter: the vendor never reveals its cost structure, only its responses to iterative queries. This property — what we might call “information privacy” — is practically important. Vendors are understandably reluctant to disclose their production costs and capacity constraints. With CPP-based VCG, they don't have to. Their agents communicate preferences implicitly, through their optimization behaviors, and the mechanism extracts only the information needed to compute the efficient plan and the associated transfer payment. From theory to a rolling horizon Real supply chains don't stand still. Demand forecasts shift, supply conditions change, and plans must be updated continuously. We extend the static VCG framework to a dynamic, rolling-horizon setting inspired by the dynamic pivot mechanism that my colleague Juuso Välimäki and I described in 2010 and 2019. Each week, Amazon and the vendor plan a six-week forward-looking horizon. The mechanism issues a purchase order for the current week, computes what the resulting JIT policy would look like going forward, and determines the CBT. The CBT has an intuitive interpretation: it is the immediate cost of the current-week deviation plus the certainty-equivalent cost — that is, the cost that Amazon is willing to pay today to avoid an uncertain cost in the future — of that deviation on future periods. Whenever the proposed plan departs from JIT, the vendor compensates Amazon for the additional cost incurred, ensuring that neither Amazon nor the vendor is ever worse off from participating and benefiting the overall supply chain cost structure. This one-directional payment structure keeps the mechanism simple and robust, though extending it to two-way transfers — where Amazon might also compensate the vendor — remains an open design challenge A menu of contracts as an alternative For settings where the decision space is lower-dimensional, we also develop a “menu-of-contracts” approach. Here, Amazon computes a set of candidate supply plans and their associated prices — each price equal to the cost Amazon incurs from that plan— and offers the full menu to the vendor. The vendor simply picks the option that maximizes its own utility. By the logic of VCG, the vendor's best strategy is to choose the socially efficient plan. This approach has the advantage of transparency: the vendor sees concrete options and prices, rather than participating in an iterative optimization. It also opens the door to incorporating partial information: if a vendor indicates a general preference (say, for shipping more to one region), Amazon can tailor the menu accordingly. In our numerical example, the menu approach recovers the first-best outcome, with the vendor choosing the globally optimal plan from among the offered alternatives. Looking ahead The CPP-VCG framework is, at its core, a general-purpose tool for achieving consistent outcomes when each party acts on information unavailable to the other. The supply chain application we describe here is a natural first use case, but the underlying logic extends well beyond it. Vendor negotiations, Fulfillment-by-Amazon seller collaboration, and multiparty logistics planning all involve the same fundamental structure: interdependent decisions, private costs, and benefits that can be unlocked only through carefully designed incentive-compatible mechanisms. Several open questions remain. How should the mechanism handle the situation in which the vendor faces supply shortages and cannot deliver the agreed-upon quantities? Can mutual commitment structures — where both parties share obligations — improve outcomes when forecasts are highly uncertain? These are questions that sit at the intersection of economic theory and large-scale systems engineering, precisely the space where mechanism design has the most to offer. With Flo Pro, we have taken a first step toward making that potential concrete.

Building trust into AI

Mon, 04 May 2026 15:07:58 GMT

At Amazon, AI now touches everything from warehouse logistics to customer service chatbots to AWS cloud services used by thousands of enterprises, making it a business-critical technology. It’s therefore imperative that the models Amazon develops and deploys are as safe, fair, and robust as possible: responsible AI (RAI) is not an optional add-on. As Rahul Gupta, senior science manager and RAI lead for Amazon’s Artificial General Intelligence (AGI) organization, puts it, “Responsibility is baked into the product design from day one.” Amazon’s commitment to safety and responsibility goes back long before the generative-AI boom. Gupta and researchers on his team worked in the Alexa AI organization, where the company “developed some muscle on defining how RAI should be done.” The focus, he recalls, was on developing policies and implementations as well as methods to evaluate their effectiveness. As Amazon began building its own large models, the RAI expertise from Alexa proved a valuable resource. In concert with Amazon’s policy team, AGI scientists have built an RAI pipeline that addresses four phases of model development: pretraining, post-training, evaluation, and third-party monitoring. At each stage, researchers grapple with distinct challenges to ensure that trustworthy systems can adapt, at scale, across situations, applications, and geographies. From this framework, Amazon has built over 70 internal and external RAI tools, funded or published more than 500 research papers, and delivered tens of thousands of hours of RAI-focused training to its employees. Amazon has a three-pronged approach to RAI: anticipate risks before they materialize, teach models to navigate ambiguity, and build systems that can adapt — to government transitions, high-profile incidents, new regulations, and other social changes. Below are some of the scientists across Amazon’s responsible-AI and policy teams who put this approach into practice — each tackling a different phase of the AI lifecycle. Teaching foundations: Pretraining Chentao Ye is a senior applied scientist on the AGI RAI team, working on pretraining, the earliest stage of LLM training, where the model develops general linguistic competences. It’s become increasingly critical to address RAI at this stage, says Ye, to ensure that the model has the information necessary to adapt to policies established by Amazon’s policy team. “Pretraining is the stage where we teach our most fundamental concepts of RAI,” Ye says. “It’s like teaching a child about the world before we expect them to make some decisions.” Pretraining typically involves large volumes of public data, but the RAI team augments that data with datasets specifically designed to instill principles of safety, security, and fairness. Those datasets are vast and diverse — a “rich diet” of content including internal and public RAI guidance, best practices, RAI-related news and incidents, information about domains such as chemical and nuclear engineering and coding security, text, audio, and images. Also included in the corpus is information in different languages and from different cultures, to ensure the model is global and multilingual. To help the model better incorporate this array of information, researchers create training tasks, also known as learning exercises, for it. “Having this data isn't enough. We need to help the model process and understand it effectively,” Ye says. For instance, Ye and his colleagues might take a policy document about privacy and convert it into multiple learning exercises: explaining privacy concepts, answering questions about compliance, and determining whether certain actions would violate privacy guidelines. These varied tasks help the model develop a deeper, more nuanced understanding of RAI principles. Another active area of research is how to handle potentially harmful content in the training corpus. “It's not simply about filtering everything out,” Ye explains. “If a model has never encountered certain harmful concepts during pretraining, it won't recognize them as sensitive, making post-training guardrails less effective.” The team is exploring approaches that add educational context to certain filtered content before reintroducing it — teaching the model what harm looks like and why it should be avoided, rather than leaving it entirely unaware. In addition to RAI acquisition, another area of focus is what’s called RAI modality alignment. LLMs need to understand how to apply RAI principles across all the modalities they encounter. Modality alignment maps other modalities into a semantic space they share with text, which is often more readily available, Ye explains. For example, a college textbook might include figures of high-risk chemical, biological, radiological, and nuclear materials (CBRN) and text descriptions of the same concepts. The team designs a range of LLM tasks that effectively encode the data into the same space. One active research area is developing a variety of techniques to test for pretraining quality, says Ye. The team is taking two complementary approaches. The first tests whether the model has actually acquired RAI knowledge during pretraining. “We use metrics like perplexity” — which quantifies how well a probability distribution predicts a given sample — “to measure how well the model can generate content in specific RAI domains,” Ye explains. The second approach tests the way that the model responds to sparse questions that might appear in later testing exercises, where the expected responses — like refusals or deflections — weren't explicitly taught during pretraining. “This helps us test whether the RAI knowledge it gained during pretraining enables it to generalize to real-world scenarios with just limited examples or instructions,” Ye says. Post-training: Reinforcement learning from human feedback Once models learn to follow instructions and produce both helpful and harmless responses, they advance to reinforcement learning from human feedback (RLHF). Senior applied scientist Charith Peris, who leads this phase of model development, and applied scientist Yao Ma explain that RLHF focuses on using feedback from or preference comparison with humans to give models a sense of judgement. “RLHF is done to make sure the foundation model aligns with the behavior expected by humans,” says Peris. This stage of training provides the model with a reward based on how well its response to a query meets a predetermined criterion. The rewards are provided by various response verification systems. One approach uses so-called auxiliary-reward models, which are trained on outputs that humans have ranked. For responsible AI, this stage offers the ability to optimize the model to generate responses that are “policy adherent,” hewing to the rules and guidelines devised by Amazon’s policy team. “Providing the right rewards is a critical part of RLHF,” says Ma. In one case, the core model itself is used to generate multiple responses to a range of unsafe and borderline safe queries. These responses are ranked and rated by humans based on their helpfulness and policy adherence and then used to train auxiliary-reward models. Another response verification approach uses an independent LLM as a judge. The model generates a response for each prompt in the training set, and this response, together with a set of rubrics about what makes a response policy adherent, is passed to the judge. The judge is then instructed to provide a score based on how well the response aligns with the rubrics. Both the auxiliary-reward models and the judge-based systems can be used individually or in combination to provide RLHF rewards. The model is evaluated in two phases: during and after training. In the first phase, the model is tested at frequent, short intervals using lightweight benchmarks that provide directional signals on performance across critical capabilities. In the second phase, saved checkpoints, each a complete snapshot of the model's state and parameters at a given point in training, are systematically evaluated against a broader set of test data to identify which checkpoint achieved the best overall performance. Behavior in check: Evaluations A major focus of the evaluations team is to build model-breaking datasets — robust collections of prompts that trigger inappropriate, unsafe, or policy-violating responses. “We know models are improving month over month,” says Jwala Dhamala, a senior scientist with Amazon AGI . Bigger, better responsible-AI datasets are playing a large part in this, she says, as well as improved mechanisms to capture how well the models incorporate responsible-AI principles spanning multiple modalities and regions. Working closely with Amazon’s policy team, Dhamala says, is key to developing evaluations for RAI. Amazon’s RAI work has eight pillars: privacy and security; safety; fairness; veracity and robustness; explainability; controllability; governance; and transparency. "For each pillar, we focus on tests that could lead the model to output something that violates responsible-AI policies. Simultaneously, we focus on testing if a model is refusing excessively or refusing to respond to benign requests," Dhamala explains. The data comes from everywhere: human experts known as red teamers who try to break models, external security partners, public benchmarks from universities, even social media where real-world problems surface organically. The RAI team evaluates models throughout the model-training and deployment cycle, Dhamala explains, from pretraining to post-training and predeployment, when all scaffolding is attached. Each stage has its own specially designed evaluation processes, and more testing happens in the later stages, when the model is closer to end users. "We collect datasets, evaluate, then collect new datasets, evaluate again,” Dhamala says. She adds that the team is currently working to automate more of the evaluation process. It’s also pushing into newer areas of research. Deception in conversations that require many back-and-forth interactions over weeks or months (also called long-horizon interactions) is emerging as a concern, but there aren't many established benchmarks for detecting it. Creating them requires an understanding of what deception means across different long-horizon contexts, an understanding grounded in social-science research. Another open area of research is an automatic red-teaming framework to evaluate emerging responsible-AI risks. The idea is that an autonomous agent or a system of agents would compete or collaborate in attempts to provoke undesired behaviors. Third-party collaborations: Frontier risks While most RAI work addresses common misuse patterns, Tong Wang, a senior applied scientist with AGI, focuses on a different category of risk: frontier risks, or “systemic risks that could take down entire systems.” These include the use of AI models to research CBRN (chemical biological, radiological, and nuclear) attacks and to research or launch cyberattacks. These are scenarios where AI capabilities could enable nonexperts to cause catastrophic harm. The evaluation process for frontier risks is exacting. First, automated benchmarks test whether the model has acquired dangerous knowledge. If it passes certain thresholds — answering questions about weapons of mass destruction with concerning accuracy — that triggers human review. Third-party experts in relevant domains evaluate whether the model has crossed safety boundaries. And the process is ongoing: with each model update, the team compares the new model’s capabilities against those of earlier models. "We have to be very careful,” Wang says. “False positives and false negatives both have costs." With public models, identified risks are mitigated by guardrails: when a person asks about a particular topic at a particular level of specificity, the model simply won’t respond. But legitimate researchers — scientists at universities and labs with relevant expertise and appropriate oversight — may need access to restricted information for their work. Wang’s team is exploring mechanisms to provide “specialized access with heavy monitoring” for these trusted users. Those mechanisms involve what Wang calls “configurability”, using techniques like low-rank adaptors (LoRA) to make surgical changes to a model's behavior for specific use cases, without retraining the entire model. "We add configuration on top that doesn't touch the base model itself," he says. "You're not retraining a billion parameters, just a few.” Today, this approach is already in use for certain content policies. But extending it to frontier risks like CBRN is a harder problem; both the data collection and computational costs are significantly higher. "It's an open research area, studying which approaches work best," Wang notes. Agreed-upon values: Writing the policies "We partner with the Amazon science team throughout the entire model development lifecycle," explains Claire O'Brien Rajkumar, leader of the responsible-AI policy and product team. The process starts with understanding what a product team wants to launch — whether it's an image generation model or a large language model — and mapping potential harms against Amazon's eight core dimensions of responsible AI. Before building an image generator, for instance, the team might anticipate risks such as deepfakes, bias amplification (for instance, an image depicting doctors only as white males), or attempts to generate disturbing content. Identified risks are translated into specific policies that define behavioral boundaries for the model under development. These policies become "backward-working guidelines," O’Brien Rajkumar says, that inform every subsequent decision during model building. For instance, rather than sourcing images from a single vendor that might show only white male doctors, the team ensures diverse data collection that reflects the complexity of the real world. Amazon’s policies are informed by factors including industry trends, customer requests, regulations, and legal requirements (particularly around copyright and content licensing). The team actively participates in industry groups like the Frontier Model Forum and Partnership on AI, collaborating with competitors to establish best practices in an under-regulated space. Academic partnerships help identify emerging risks through the development of benchmarks as well as engagements such as the Trusted AI track of the Amazon Nova AI Challenge, where university students compete to identify safety vulnerabilities in Nova models and the associated fixes. Customer feedback shapes practical policy decisions, such as carving out exceptions for legitimate use cases such as LLM-based security testing, even when the general policy prohibits malware generation. The policy team operates through cross-functional working groups that include legal, public-policy, product, security, and RAI experts. Regulatory developments like the EU AI Act and California's AI Transparency Act directly influence policy evolution. "These are living, breathing things," O'Brien Rajkumar notes, acknowledging that policies must adapt as society becomes more comfortable or less comfortable with certain AI risks. Beyond policy development, and specific responsible-product guidelines, the team manages the implementation of AI safeguards and oversees red-teaming operations using both in-house experts and third-party vendors. It also conducts manual reviews of model outputs to assess real-world risk. “These are high-judgement decisions, working on the boundaries of what violates policy or not,” says O’Brien Rajkumar. “We have to really understand what each policy means in practice.”

Preserving the privacy of AI training data

Wed, 29 Apr 2026 17:59:07 GMT

Large language models, the highest-profile machine learning (ML) models used today, are trained on huge corpora of public data. But many ML models are trained on smaller, proprietary datasets, which can be highly sensitive and should be kept private. Examples include a hospital fine-tuning a diagnostic model on patient radiology scans, a bank training a fraud detector on transaction histories, or a pharmaceutical company building a drug interaction model from clinical trial records. In each case, the training data itself is the asset that must be protected, but a well-constructed attack on these models can potentially extract information about their underlying training data. Such attacks are possible when the attacker is restricted to submitting adversarial inference queries to a model trained by a single data owner. Alternatively, when multiple data owners collaborate to train a model through federated learning (FL), in which a central server produces a global model by aggregating model updates generated from siloed datasets (instead of collocating the raw data), there exist attacks in which an adversarial server can reconstruct training data from the model updates. Consider three hospitals collaborating to train a shared cancer-screening model without pooling patient records. If the aggregation server can reconstruct one hospital's training images, then the privacy promise of federated learning is broken, and so is each hospital's compliance with patient consent agreements. Finally, an adversarial FL participant could even potentially reconstruct an honest participant's private training data from the global model. These risks are not hypothetical. A 2023 paper from Google DeepMind demonstrated that GPT-3.5-turbo could be prompted to regurgitate verbatim training data, including personally identifiable information. Smaller, domain-specific models trained on concentrated, sensitive datasets are even more vulnerable. As organizations increasingly train models on sensitive financial records, patient health data, and proprietary business intelligence, the attack surface grows proportionally. A successful attack against a healthcare model could reveal whether a specific patient's records were used in training, a violation of regulations such as the US Health Insurance Portability and Accountability Act (HIPAA) and the EU's General Data Protection Regulation (GDPR). An attack against a federated-learning system could reconstruct raw training samples that should never have left their source. For any organization training on private data, understanding and mitigating these threats is no longer optional; it is necessary for responsible AI deployment. In this post, we walk through three escalating attack scenarios: membership inference against a single model, data reconstruction from federated-learning gradients, and training-data extraction from a shared global model. We show how differential privacy and secure multiparty computation defeat each one. An attack on model inference Anyone with query access to a model can potentially determine whether a specific record was used to train it, an attack known as membership inference. Imagine that a hospital deploys a diagnostic model as an API for referring physicians. A malicious actor could probe the API to determine whether a particular patient's records were included in the training data. This would confirm that the patient was treated at the hospital and reveal details about their medical history. In a 2023 paper at the Conference on Neural Information Processing Systems (NeurIPS), Amazon Web Services researchers showed how this works in practice. A trained model tends to produce higher-confidence predictions for inputs it was trained on, a form of overfitting the attacker can exploit. The attacker first generates a dataset that approximates the distribution of the model's training data, then records the model's confidence scores on those samples. Using these scores as labels, the attacker trains a proxy model that learns a confidence-score cutoff separating training data from non-training data. Given a candidate record, the attacker evaluates the proxy model to obtain a cutoff, then queries the target model. If the target model's confidence score exceeds the cutoff, the record was likely in the training set. The authors demonstrated this against a ResNet-50 model trained on ImageNet-1k: 97% of records their attack flagged as training data were indeed training data. Mitigation through differential privacy We’ll show how to mitigate such membership inference attacks with differential privacy (DP), a mathematical framework for computing aggregate statistics (e.g., an average) while bounding how much any single input can influence the result. The core idea: if we can randomize the function so that adding or removing one record from the dataset barely changes the distribution of the function output, an attacker cannot confidently determine whether that record was included. Formally, a randomized function is differentially private if, for any single record added to or removed from the input dataset, the probability of any given output changes by at most a factor of eε, where e is the base of the natural logarithm and ε is the privacy budget. A smaller ε means tighter privacy but more noise in the computation, and vice versa. While NIST guidance suggests that ε < 1 will generally enforce a low enough privacy risk, many real-world deployments operate between 1 and 10, with situation-dependent privacy outcomes. Empirical studies indicate that ε as high as 3 can still provide meaningful data privacy against attacks like membership inference, though our understanding of the effective guarantees of DP against such attacks continues to evolve. DP defeats membership inference because the attack relies on a gap between the model's confidence on training data and on unseen data. DP narrows that gap by ensuring the model would have learned nearly the same parameters whether or not any particular record was included in its training data. How can this approach be applied to ML? Neural networks are trained using stochastic gradient descent (SGD), in which the difference between the model’s output on a training sample and the target output for the sample is propagated back through the model, and the model parameters are adjusted to reduce the difference; the adjustment corresponding to the sample is called a gradient. In practice, the model parameters are typically adjusted according to a batch gradient — the average of sample-specific gradients for a batch of samples. In a landmark 2016 paper, Google researchers introduced DP-SGD, which adds calibrated Gaussian noise to each batch gradient during training. We implemented DP-SGD and trained a neural network on the EMNIST handwritten-letter dataset. The DP model achieved 78% test accuracy at ε = 1.5 and 82% at ε = 3.0, compared to 90% without DP. DP addresses attacks on a single model, but what happens when multiple organizations collaborate to train one? Federated learning introduces a different attack surface, one that targets the training process itself. Data leakage from federated learning Federated learning is a method of decentralized ML in which a global model is trained on datasets distributed across multiple parties, without direct sharing of the datasets. Each party trains an initial model on a local training batch, obtaining a local gradient. The local gradients are then sent to a central server, which averages them into a global gradient. The parties then produce copies of the global model by updating their local models with the global gradient. However, in a 2019 NeurIPS paper, a team of MIT researchers demonstrated a surprising result: the parties' local gradients leak information about the training samples from which they're computed, enabling model inversion attacks in which the server can reconstruct the parties' training samples. Even in scenarios in which the server is not viewed as adversarial, this attack demonstrates that the gradients leak the parties' training data, defeating the privacy goals of FL. This attack relies on the observation that a gradient directly contains data about the sample from which it is computed. Consequently, a sample can generally be reconstructed from its gradient, and two semantically distinct training batches are unlikely to admit the same batch gradient. Therefore, the attacker frames the problem of reconstructing a party's batch samples from its local gradient as an optimization problem: find the training batch whose gradient is minimally distant from the target gradient. The attacker can then approximately compute the solution (the training batch) by applying SGD. In our experiments on the EMNIST dataset, the attack recovered single-sample batches exactly and three samples from a batch of size seven. Preventing this data leakage requires ensuring that no party, including the server, ever sees another party's gradient in the clear. Mitigation through secure multiparty computation Secure multiparty computation (MPC) is a cryptographic protocol that lets multiple parties jointly compute a function over their private inputs, without revealing anything beyond the function's output. Intuitively, the parties exchange only encrypted intermediate values, so no party ever sees another's raw input. A simple example illustrates the core idea: suppose three parties hold private values x, y, and z. Each party splits its value into three random shares that sum to it, then distributes one share to each party. Each party sums the shares it receives. The resulting sums are themselves random, but they add up to x + y + z. After exchanging these sums, all parties learn the total but nothing about each other's individual inputs. Private federated learning (PFL) applies this secure-sum technique to FL: instead of sending raw local gradients to a server, the parties secret-share their gradients and aggregate them via MPC, so the server only ever sees the summed result. More efficient PFL protocols exist, including one presented in a 2023 paper coauthored by Amazon senior principal scientist Tal Rabin, but the core security principle is the same. We ran our model inversion attack against a party's local gradient computed under our PFL protocol, again using the EMNIST dataset. The attack was unable to reconstruct any training samples. MPC protects the gradients exchanged during FL, but the global model itself is shared with all participants. Can an adversarial participant exploit the model to recover others' data? We’ll explore this problem in the next section. An attack on FL global models and mitigation with DP We've seen how PFL enables n parties to securely compute a global FL model. However, the 2022 paper of Fowl et al. and 2025 paper of Shi et al. together describe an attack that enables an adversarial FL participant to reconstruct another participant's training data from the global model itself. In this attack, the attacker adds a preprocessing layer with ReLU activation (a common neural-network activation function that outputs positive inputs verbatim but outputs zeros for negative inputs) to the model. That layer consists of nB neurons, where B is the batch size. This is because each of the n parties produces a local gradient that is an average of B sample-specific gradients, so the global FL gradient is an average of nB sample-specific gradients; each of the nB neurons in the preprocessing layer will be used to reconstruct a distinct training sample. The attacker carefully crafts the preprocessing layer's parameters so that ReLU activates the signals of all samples in the first neuron of the global gradient, all but one sample in the second neuron of the global gradient, all but two samples in the third neuron of the global gradient, etc. Therefore, the attacker simply examines the entries of the global gradient corresponding to the nB neurons and successively subtracts the components between adjacent neurons to tease apart the nB sample-specific gradients. As we mentioned earlier, a training sample can be directly recovered from its gradient. In our experiments on the EMNIST dataset, the attack recovered all but one of the parties' local batch samples from the global gradient. But after altering our private FL protocol to instead output a differentially private global gradient — computed via DP-SGD with privacy budget of 1.5 — the attack failed to recover any meaningful information from the global gradient. Taken together, DP and MPC form complementary layers of defense: MPC protects what is exchanged during training, and DP protects what the final model reveals. Building defenses before attacks scale The experiments above have clear implications: attacks on ML training data are practical today, and the private-computing tools to defeat them are mature enough to deploy. The privacy-utility tradeoff is real: our DP-SGD models retained 78–82% accuracy at meaningful privacy budgets, compared to 90% without DP. It is worth noting that the accuracy impact of DP depends heavily on the task and dataset. Our EMNIST experiments used a relatively small model on handwritten letters, where the noise has an outsized effect. In practice, larger models trained on richer datasets absorb DP noise more gracefully. NIST SP 800-226 notes that large models pretrained on public data show strong privacy-utility tradeoffs when fine-tuned with DP-SGD. For many production use cases, such as fraud detection or clinical risk scoring, a modest accuracy reduction is an acceptable cost when the alternative is exposing protected data to the attacks described above. The right privacy budget is ultimately application dependent: a model screening radiology scans may tolerate less accuracy loss than one flagging suspicious transactions, and organizations should calibrate ε to their specific risk and regulatory requirements. These techniques are already in use at Amazon. We are building private-computing capabilities — differentially private training pipelines and secure aggregation for federated learning across organizational boundaries — into production systems. For instance, our fraud prevention teams use differentially private training to protect customer financial data while maintaining detection accuracy. If your organization trains models on sensitive data, we invite you to explore AWS's privacy-preserving ML capabilities and connect with our team.

How catastrophic is your LLM?

Mon, 27 Apr 2026 19:01:26 GMT

As large language models (LLMs) become increasingly useful across a variety of domains, the stakes of keeping them safe rise accordingly. Because bad actors might, for instance, try to use LLMs to write malicious code or make step-by-step guides for synthesizing toxic compounds, researchers are developing rigorous safeguards to keep LLMs from generating content that could pose serious public safety and security risks. The most common way to assess the risks to LLMs is called red-teaming, where human evaluators design adversarial prompts intended to elicit harmful responses. But expert-curated sets of prompts cannot capture the full range of possible outcomes. Moreover, many evaluations focus on isolated prompts rather than conversations, which are where harmful behavior often emerges. Finally, today’s benchmark failure metrics provide only a single score, rather than confidence bounds on worst-case conversational risks. This makes the findings unreliable and non-generalizable to the vast space of possible conversations. In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we, along with researchers from the University of Illinois Urbana-Champaign (UIUC), address these red-teaming limitations by focusing on the failures within conversational threat models and then assigning a probability to an attack rate, which is defined as the number of successful attacks divided by the total number of attacks. Our approach, called the C3LLM (certifying catastrophic conversational risks in LLMs) framework, shifts the focus of benchmarking failure from empirical spot-checking to statistical certification. How to model a conversation In order to build our framework, we first needed to model conversations, also known as “multiturn dialogues.” We used a graph where each node corresponds to a prompt. The edges that connect the nodes indicate that the prompts are semantically related. This graph approximates plausible conversational transitions, capturing how a user might naturally progress through related questions. In this way, we generate a more complete picture of queries, one that maintains the complexity of possible conversations. The graph also lets us define the distribution of conversational threats, allowing us to determine the probability of harm across a range of adversarial capabilities. We simulate the lowest level of adversarial capability by sampling prompts independently, which is similar to traditional benchmarking, focusing on a single node or query at a time. This approach is denoted as Random Node with Jailbreak (RNwJ) is our result. The next level up involves sampling a sequence that follows semantically connected paths through the graph. We developed two variants, the first is termed Graph Path vanilla (GPv), where each query is sampled following the graph, the second appraoch — Graph Path harmful target constraint (GPh), restricts the final query to come from a target harmful set. For the most advanced level of bad-actor capabilities, we approximate adversarial steering, when a bad actor coaxes an LLM toward a harmful output. For this level, we sample adaptively, examining prior movements throughout the graph-based conversation to map the distance to a query that ultimately produces the harmful output. This approach — Adaptive with Rejection (AwR) — can mimic realistic red-teaming where an attacker adapts their phrasing to circumvent safety mechanisms. The graph gives us the ability to create sets of multiturn-dialogue prompts — specific sequences of queries — that we can run on a target LLM. We then label the LLM responses as catastrophic or non-catastrophic using a separate ChatGPT-based judging mechanism that determines whether the model responses are harmful. This produces empirical estimates of the attack success rates under each conversational distribution. Given the attack success rate, C3LLM uses the Clopper-Pearson method to calculate the lower and upper bounds on the probability of catastrophic risk. Application: How does C3LLM perform on frontier LLMs? UIUC researchers applied the proposed C3LLM framework to frontier proprietary models available at the time of the study, such as Claude-Sonnet-4 and Nova Premier, as well as open-weights models (models whose trained parameters are publicly available). The following figures show the certification results on the chemical/biological benchmark. Each panel shows the distribution of lower bounds and upper bounds under different specifications for one LLM. The following figures show the certification results on the cybercrime benchmark. Each panel shows the distribution of lower and upper bounds under different specifications for one LLM. The results reveal that catastrophic risks are nontrivial for all frontier LLMs, with notable differences in safety across models. By comparing the bounds, we observe that among the models evaluated, Claude-Sonnet-4 and Nova Premier are safer than the others, while Mistral-Large and DeepSeek-R1 exhibit higher risks. In particular, Nova Premier demonstrates consistently low risk levels, largely because its built-in guardrails often block potentially unsafe content. On the other hand, DeepSeek-R1 reaches a certified lower bound of over 70% in cybercrime scenarios under RNwJ distributions. Unlike prior work that reports attack success rates on fixed benchmarks, our approach provides high-confidence probabilistic bounds over large conversation spaces, enabling meaningful comparisons across models. We open-sourced the C3LLM framework for reproducibility and hope it enables researchers in industry and academia to perform more-principled safety studies.

Isabelle/HOL: The proof assistant behind the Nitro Isolation Engine

Fri, 17 Apr 2026 13:00:00 GMT

At Amazon’s 2025 re:Invent conference, Amazon Web Services (AWS) announced the Nitro Isolation Engine (NIE), a software module tasked with providing resources to AWS clients while ensuring the security of customer data. AWS also announced the formal verification of the isolation engine’s correctness and security guarantees, using a proof assistant called Isabelle/HOL. As the first formally verified cloud hypervisor, NIE sets a new standard for cloud security. A proof assistant is an automated tool that can help human users develop formal proofs — of mathematical theorems, the validity of hardware or software systems, or anything in-between. Several proof assistants are in common use, and we chose Isabelle/HOL because it struck the right balance among expressiveness, automation, proof readability, and scalability. So what do I mean by that? Logical reasoning by computer There is no fixed language of mathematics, but we can create languages for expressing mathematical reasoning, just as programming languages express computational tasks. And just as programming languages involve trade-offs between expressiveness and performance, mathematical languages involve trade-offs between expressiveness and ease of automation. Automation is vital because the construction of a formal proof is both time consuming and extremely tedious, analogous to constructing a ship in a bottle. The most elementary mathematical language is Boolean logic, the world of the binary operators AND, OR, and NOT. Because this language is so simple, powerful automatic solvers exist for it. In 2016, Carnegie Mellon professor Marijn Heule — now an Amazon Scholar — and his colleagues encoded into Boolean logic an unsolved mathematical question, the Boolean Pythagorean Triples Problem, and used automatic solvers to help create the largest proof ever, 200 terabytes long. A richer mathematical language called first-order logic allows us to talk about some domain of interest — the integers, say — and to define functions over that domain. And we can go beyond Boolean logic by including the quantifiers "for all" and "there exists" in assertions. In this sort of language, we can express statements such as "every prime number greater than two is odd". We can also prove the following theorem, due to Lewis Carroll: No ducks waltz; no officers ever decline to waltz; all my poultry are ducks. Hence, none of my poultry are officers. However, most people prefer a still stronger mathematical language, where they can define types, as they do in programming. In higher-order logic, there are even function types, as found in functional programming languages such as Haskell. Higher-order logic is much richer than first-order logic, able to express statements such as Every set containing the number 1 and closed under addition contains all the positive integers. It appears to be rich enough to express most of mathematics. The richest mathematical languages — called dependent-type theories — even allow types to take arbitrary values as parameters, e.g., T(i), where i is an integer. The best-known such languages are Lean and Rocq. Powerful automatic theorem provers exist for first-order logic, but for higher-order logic and beyond, full automation is not available. This is the price of expressiveness. A proof assistant allows users to build proofs interactively, supported by partial automation and the possibility of coding their own proof searches. A proof assistant enforces strict compliance with the laws of logic, typically through a kernel architecture that gives only a limited portion of the code the right to create a theorem. A proof assistant also supports the interactive development of a possibly huge formal-specification hierarchy. For example, verification of the Nitro Isolation Engine (NIE) rests on specifications of the architecture of the Graviton-5 processor, the Rust code of the hypercalls and their functional correctness, and the security properties that are to be proved. These take up much of the quarter of a million lines constituting the formal proof. Higher-order logic is supported by HOL and HOL Light, two closely related proof assistants, and has been used to verify hardware designs, floating-point algorithms, and pure mathematics since the 1990s. AWS senior principal applied scientist John Harrison developed HOL Light, and he has used it to improve the performance of digital signatures on Amazon’s Graviton2 chip by up to 94%, by verifying an optimized version of the cryptographic algorithms. The code was delicate, and exhaustive testing is not feasible; only a formal verification of full functional correctness would do before the deployment of such critical software. But today we are interested in Isabelle/HOL. Overview of Isabelle/HOL The most visible difference between Isabelle/HOL and the other HOL systems — which are all based on higher-order logic — is its specification and proof language. With most proof assistants, users state what they want to prove, followed by lists of commands that replace the original goals with series of subgoals in a kind of whack-a-mole game. In Isabelle and to some extent Lean, the proof language allows desired intermediate goals to be written out explicitly, allowing a better controlled proof process and a more legible proof document. There are plenty of examples online. Other notable features are as follows: a user-configurable parser, which allowed us to embed a significant fragment of the Rust language into our specifications; type classes for principled overloading, so that say + can be given its natural meaning, not just for a variety of numeric types but for machine words and in other appropriate contexts; locales, a lightweight module system allowing a hierarchy of specifications to be defined and interpreted in various ways, even within a proof; powerful built-in automation through simplification and backchaining proof search; sledgehammer: one-click access to even more powerful external automation; counterexample-finding tools, for identifying claims that are actually false; code generation from executable higher-order specifications, which we used to test conformance. For the verification of NIE, we began by implementing a specialized language called separation logic on top of Isabelle/HOL. Separation logic is designed for verifying program code operating on shared resources. We coded our own proof automation and also used what was built in. We therefore could use separation logic but also plain higher-order logic when we wanted to. Isabelle turned out to be resilient enough and efficient enough to cope with the truly gigantic subgoals. It could run that quarter-million-line proof in half an hour using an off-the-shelf laptop. Some applications of Isabelle/HOL The single most impressive application of Isabelle prior to NIE is probably the verification of seL4, a widely used microkernel. This proof was also about a quarter of a million lines when first announced, although it is now much longer. The seL4 developers proved that the microkernel’s C implementation refined the abstract specification, yielding full functional correctness of the core operations. And they have observed no bugs in the verified parts of the code, although testing still plays a role in covering unverified parts and certain assumptions that cannot be formalized. Isabelle was also used in the following projects: to formalize the semantics of the WebAssembly language, to identify errors and, in particular, to prove the soundness of its type system; to create a verification framework for the Cogent programming language; to prove the correctness of algorithms for conflict-free replicated data types, which are used for distributed editing; to formalize numerous results in pure mathematics; to verify cryptographic protocols at an abstract level. Isabelle is free, open source, and available to download. It runs on all the main operating systems on any machine that has enough memory.

Customized Amazon Nova models improve molecular-property prediction in drug discovery

Wed, 15 Apr 2026 16:10:55 GMT

In recent years, large language models (LLMs) have become indispensable assistants for software engineers and knowledge workers. Nimbus Therapeutics enlisted us at Amazon’s Generative AI Innovation Center and Artificial General Intelligence (AGI) organization to investigate whether it’s possible to make equally capable assistants for medicinal chemists discovering new drugs. Such an agent could significantly speed up drug discovery, potentially saving lives. AI in drug discovery has traditionally involved models called graph neural networks, or GNNs. GNNs are the workhorses of molecular-property prediction across pharmaceutical R&D, and for good reason: they deliver strong accuracy on well-defined tasks. Typically, multiple GNNs, specialized for different molecular properties, have to be built and maintained in-house — an expensive, operationally complex process. In recent years, the success of LLMs in a variety of research domains has caught the eye of biotech firms, but for drug discovery, general, off-the-shelf LLMs have proven to be less accurate than GNNs or other computational methods. We have adopted a new approach that combines the accuracy of GNNs with the generalizability and reasoning ability of LLMs. Using supervised fine tuning (SFT) and reinforcement fine tuning (RFT) to customize a general-purpose LLM, we were able to achieve results comparable to those of using multiple GNNs, at a fraction of the time and labor. Fine-tuned LLMs offer a significantly simplified workflow. In the traditional setting, each GNN has a separate interface, with its own quirks, data formats, and failure modes. Results come back as disconnected numbers that the chemist must manually integrate. When a new property needs to be predicted, someone must construct a multitask dataset and train and validate an entirely new model, a process that can take weeks. In contrast, a single, fine-tuned LLM allows a chemist to submit one query and receive predictions on all molecular properties of interest. Adding a new property requires incremental fine tuning rather than building a new model from scratch. Moreover, a language model opens the door to a qualitatively different capability: conversation. With a fine-tuned LLM, it’s now possible to ask for the reasoning behind the model outputs or to suggest molecular modifications that might yield the desired properties. This points toward an assistant that unifies molecular-property prediction and generation in one interactive experience, which we see as the ideal next step for AI-assisted drug design. Customized LLMs unlock domain-specific scientific assistants, giving lean biotech teams a practical way to collaborate with AI systems that speak their scientific language. Today, bringing a single drug to market takes 10 to 15 years and costs on average over $2 billion, with only about 8 percent of drug candidates that enter clinical trials receiving FDA approval. We believe that AI assistants could particularly improve productivity in the early stages of this pipeline, where chemists design molecules with druglike properties. Increasing the speed of development and the number of viable candidates would maximize the chances of delivering a safe and efficacious drug to the clinic. What we looked at Our work with Nimbus Therapeutics focused on properties spanning three categories critical to drug development: Lipophilicity (which has one associated property) determines whether a molecule can cross biological membranes. It is fundamental to drug absorption and distribution and affects all other characteristics of a drug. Permeability (four associated properties) measures how easily a drug enters the body via the bloodstream. Clearance (six properties) determines how quickly the body eliminates a drug. A drug that takes too long to be cleared could become toxic; one that is cleared too quickly won’t be effective. These properties span different value ranges and exhibit complex interdependencies — in practice requiring separate multitask GNN models . We tested the general-purpose LLMs Claude Sonnet 4 and Nova 2 Lite on the task of predicting all three sets of properties for particular molecules. Despite their impressive capabilities elsewhere, the models significantly underperformed specialized GNNs, with an accuracy gap that ranged from 40% to over 200% error, as measured by the root mean squared error (RMSE), depending on the property. However, we discovered that Nova 2 Lite with supervised fine tuning (SFT), followed by reinforcement fine tuning (RFT), could close that gap. Our single, fine-tuned LLM predicted 11 different molecular properties with accuracy similar to that of multiple separately trained multitask GNN models. How we did it Our approach to fine-tuning the LLM follows a principle common to both human-expertise development and machine learning: foundational knowledge must precede performance optimization. During SFT, the model learned core concepts such as molecular structure and property relationships. Then, during RFT, training shifted to the development of predictive judgment through practice and feedback. During SFT, we exposed Nova 2 Lite to more than 55,000 molecules labeled with experimental measurements across 11 properties. SFT was essential because the domain-specific tasks we asked the model to perform fall far outside Nova 2 Lite’s generalized pretraining data. For example, we use a notation called SMILES (simplified molecular-input line entry system) to represent chemical structures. Without SFT, the LLM wouldn’t have been able to perform a task like “predict chemical property from SMILES strings in structured JSON format”. The second training stage, reinforcement fine tuning (RFT), is especially critical for properties with limited experimental data, where SFT alone struggles to generalize. RFT also enables the intramodel transfer of learning across properties. For instance, lipophilicity affects permeability, and both can inform metabolism predictions. Further, RFT shifts the learning objective from pattern matching ("given molecule X, output value Y based on similar examples") to quality optimization ("minimize prediction error across all properties"). We tested the SFT and RFT models on 15,000 molecules unseen during training. We also built a system prompt that encompassed a knowledge of both core chemistry and our 11 chemical properties of interest, including their definitions and expected value ranges. During the RFT stage, we experimented with three strategies for generated rewards, which guided the learning process. Molecular-property prediction is particularly amenable to reward engineering for RFT since the output is a single number, which allows us to measure exactly how far off each prediction is. Our first strategy was to use an exponential decay function, so predictions closer to the true value received exponentially higher rewards. But at high error, improving from “terrible” to merely “bad” yielded almost no reward difference, keeping the model from learning from its worst predictions, while at low error, small changes resulted in large reward differences, which made the reward signal noisy and ultimately unhelpful. Our second strategy, binary pass/fail rewards, created the opposite problem. The model received zero reinforcement for gradual improvement: it either crossed an arbitrary threshold (in our case, correct within 10 percent) or learned nothing. Rewards based on the Huber loss — a metric proposed in 1964 by the Swiss statistician Peter Huber, which limits the influence of outliers — solved both issues. Unlike exponential decay, Huber rewards don't become negligible on large errors — the model always receives a meaningful signal to improve — yet they remain stable near the correct answer, refining predictions without overreacting to small fluctuations. This yielded our best result, a 4.9% R² improvement over baseline, and we used the Huber reward as the default for training the model on multiple molecular properties simultaneously. Carrying this forward into multiproperty training, we fine-tuned a single model to predict all 11 properties simultaneously. Our best-performing model was Nova 2 Lite with RFT on top of full-rank SFT, meaning that all the model parameters were updated. It outperforms Claude Sonnet 4 by 39% and base Nova 2 Lite by 37% on average RMSE. While averaging 5% behind the baseline GNN, it matches or outperforms the GNN on 7 of 11 properties — a striking result given that a single LLM is going toe-to-toe with multiple independently trained multitask GNN models, reducing not just model count but the entire infrastructure footprint around training, deployment, and maintenance. It’s important to note that Nova Forge — a service that allows Amazon Web Services customers to use proprietary data during both pretraining and SFT — supports both SFT and RFT on SageMaker, enabling extensive model customization. Since SageMaker handles the training framework and infrastructure maintenance internally, organizations avoid the cost of building and maintaining custom training pipelines from scratch. What’s next? Based on these initial experiments and results, Nimbus Therapeutics recently deployed its Novus model on Amazon Bedrock. Novus is the company’s custom-built LLM, created through Nova Forge. In its current form, Novus handles molecular-property prediction with an accuracy that is competitive with purpose-built GNNs. The next milestone is extending those capabilities toward molecular design, enabling the model to propose structural modifications, predict their downstream properties, and explain its reasoning, all in a single conversation. Acknowledgements Leela Dodda (Nimbus), Aarush Garg (Nimbus), Matthew Medina (Nimbus), Md Tamzeed Islam , Elyse Zhang, Clement Perrot, Rohit Thekkanal, Shiv Vitaladevuni

AWS and Hopkins Engineering announce groundbreaking database for AI/ML antibody design

Tue, 14 Apr 2026 14:00:00 GMT

In 1986 the US Food and Drug Administration issued its first approval for human use of a therapeutic antibody. Despite steady advances in methodology, genetic sequencing, and biomedical science, 40 years later the process of discovering and optimizing therapeutic antibodies often remains prohibitively expensive, in terms of both cost and time. Recent experiences with pandemic-style infectious-disease outbreaks lend an even greater urgency to the need to more quickly and efficiently identify and develop these antibodies. Artificial-intelligence- and machine-learning-guided approaches to antibody design, in the form of biological foundation models (BioFM), represent a significant opportunity to address these challenges. Models built using protein language models (pLMs) and structure-based deep-learning frameworks have significant potential to predict antibody developability properties — the characteristics that determine whether a molecule is manufacturable, stable, and safe as a therapeutic. The development of those tools could drastically shorten discovery timelines while also reducing experimental costs. That potential, however, has been hindered by the lack of a public dataset that would allow researchers to benchmark those tools, a crucial step in the development of trustworthy in-silico tools for drug discovery. While there are existing public antibody datasets, they are too frequently limited by a focus on a single antibody format or target. Others are composed of naturally occurring or clinically advanced antibodies, a bias that severely limits their utility for training or evaluating predictive models. “Trust in the predictions made by these models must be grounded in evaluations against experimental data that is sufficiently large and diverse,” explained Luca Giancardo, an applied scientist with Amazon Web Services (AWS) who works on the Amazon Bio Discovery team. “That data must be representative of the real sequence space encountered during antibody engineering and balanced in terms of developability outcomes.” Jeffrey Gray is a professor in the Chemical and Biomolecular Engineering Department at the Johns Hopkins Whiting School of Engineering, where he leads the Gray Lab, which focuses on the computational prediction and design of protein structures. He is also the original developer of RosettaDock, a tool for the prediction of the structure of protein complexes from their constituent proteins. Gray noted that while AI has made tremendous progress in the prediction and design of antibody properties, his own lab’s benchmarks have shown that current models do not yet reliably predict critical developability features, such as solubility and specificity, needed for efficient design of therapeutics. He cited the lack of diverse data in standardized conditions as a primary limitation for training models. That, coupled with the absence of a comprehensive, heterogenous, large-scale database, has acted as a significant drag on the potential of developing AI tools for antibody development. Antibody developability benchmark To that end, AWS, in collaboration with the Gray Lab and Johns Hopkins Engineering are announcing the launch of the Antibody Developability Benchmark, powered by the largest and most diverse antibody dataset in public literature. This is the first large-scale benchmark of antibody biophysical and biochemical properties designed to support the development and rigorous evaluation of in-silico antibody property predictors. The Antibody Developability Benchmark is 20 times as diverse — in terms of antibody formats, targets, and developability profiles — as benchmarks currently available in the scientific literature. While other datasets may contain more individual antibody designs, they typically explore a single target or antibody framework with limited property coverage. The Antibody Developability Benchmark is unique in its combination of scale and heterogeneity, encompassing 50 seed antibodies, four structural formats, and 42 antigens. It also includes both favorable and unfavorable developability outcomes. Gray lauded the opportunity to work with AWS experts, noting that the collaboration has enabled the creation of a dataset larger and more diverse than any of the publicly available datasets. He called the project an important next step toward fulfilling the promise of AI to improve human health. The Antibody Developability Benchmark includes the first heterogeneous antibody-property dataset explicitly designed to capture favorable and unfavorable developability profiles across multiple antigens and mutation strategies. Crucially, all data was affirmed via wet-lab experiments, providing ground truth validation that existing public benchmarks lack. “This dataset will allow researchers to confidently be able to answer ‘Which model is better suited for our purposes?’,” noted Giancardo, whose Bio Discovery team led the development of the dataset. “Today there are many computational models coming out that are mostly evaluated on either proprietary data or public datasets, which are not representative of antibody heterogeneity. That means deciding what is better or worse is very, very hard — if not impossible.” The unmatched diversity and deliberate heterogeneity of the Antibody Developability Benchmark will help make those determinations possible. Michael Chungyoun, a PhD researcher at JHU who worked on the project, observed that the benchmark covers a wide space of antibodies, particularly in terms of their properties. He noted that allowing researchers to check against a very diverse benchmark can save time and labor by helping them compare models and choose the best approach. The antibody dataset The dataset consists of 50 clinically and scientifically relevant seed antibodies spanning four structural formats — IgG, VHH, NearGermline-IgG, and scFv — targeting 42 distinct antigens. It measures expression, purity, thermostability, aggregation, polyreactivity, and hydrophobicity — six traits that are essential in the development of viable therapeutic antibodies. “The composition is a deliberate design choice,” Giancardo noted. “We strove to find a balance between heterogeneity of antibody classes, therapeutic targets, and mutation types, with the aim of creating benchmarks that would be generalizable across the structural diversity of the modern therapeutic-antibody landscape.” Researchers at the Gray Lab, assisted by a sponsored research grant from AWS, helped select the seed antibodies for inclusion in the dataset. They were intentional about the seeds they chose, Chungyoun noted, opting in some cases for existing clinical-stage antibodies or FDA-approved antibodies. The team also selected antibodies more akin to those that circulate in the human body but aren't approved therapeutics. Those are called germline antibodies. Chungyoun explained that germline antibodies are those found in the human body, and they have important biophysical characteristics. While some of those characteristics are shared with therapeutic antibodies, there are also differences between the two. The extent of those differences, and how to bridge that gap, is a vital and unanswered question. Traditional antibody-based drug discovery begins with antibodies that come from animals or humans. Chungyoun explained that germline antibodies occasionally need to be modified to look more like therapeutics. That process is one researchers are still exploring. Mutation strategy The dataset also includes engineered variants of each seed antibody, generated by applying systematic mutation strategies to each seed. “Initially, the hardest thing was essentially coming up with example sequences that would cover the broad spectrum of properties and the ways of mutating these sequences,” Giancardo explained. “It's challenging because you have to do it a priori until you do it, and then you don't know what will come out.” Working with Johns Hopkins Engineering, Giancardo and his team systematically engineered variants employing a variety of approaches, including protein-language-model-guided (pLM-guided) versus non-pLM-guided mutation selection and amino acid substitutions versus insertions/deletions. “Protein language models are essentially the equivalent of large language models [LLMs] for the protein world,” Giancardo said. “There are multiple ways of looking at proteins. A common way is expressing them as a string of amino acids, which are essentially letters.” When some of the letters in the amino acid chains are masked, the models can be trained to fill in the gaps — the same "self-supervised" approach used to train LLMs. The models can also be trained to predict what changes inserting a different letter or letters — i.e., mutation — will yield. That approach resulted in a wide variety of mutations — up to 99 engineered variants per seed. The breadth and depth of those mutations contribute to another distinguishing feature of the Antibody Developability Benchmark: its deliberate heterogeneity. The inclusion of both favorable, or developable, and unfavorable, or poorly developable, examples sets it apart from existing datasets. “This range is essential for training and evaluating machine learning models, which require balanced label distributions and exposure to the failure modes they are intended to predict and avoid,” Giancardo explained. He also clarified that those failures still fall within a range of viability. “These are not examples that are obviously wrong but rather bad examples that have a fighting chance," he added. "These all still meet some baseline quality assessment, meaning researchers could reasonably send them to a wet-lab partner to test.” Zero-shot learning Gray and his team at Hopkins Engineering also collaborated with their AWS counterparts by selecting and running existing open-source antibody design and prediction models on their own. They then shared their findings with the Bio Discovery team, who compared the results those models generated against the benchmarking dataset without exposing those models to the information in that dataset. “This is essentially zero-shot inference,” Giancardo said. That siloed approach allowed both sides to have greater confidence in the results the Antibody Developability Benchmark generated. “The fact that we operated separately gave us confidence that we were not introducing errors. There is no data leakage of any sort, even from an external perspective.” The teams compared their data and used those results to further fine-tune the Antibody Developability Benchmark. That iterative process means researchers who utilize the benchmark can have greater confidence about the viability of their models before the necessary, and costly, step of working with a wet lab partner. That can also shorten the overall timeline in terms of experimentation. “When you are confident enough to do a screen, then you can turn to the wet lab, get new metrics, and further train on those results, which will be much, much, much more meaningful,” Giancardo explained. The future Researchers at both AWS and Hopkins Engineering emphasized the importance of sharing model benchmarks based on the Antibody Developability Benchmark Dataset with the larger scientific community. The benchmark results are now available as part of Amazon Bio Discovery; additional benchmarks will be added over time and released in a paper later this year. The sharp uptick in proposed protein AI models has researchers excited, but the expense and time commitment of wet labs has meant researchers have thus far been unable to compare those models head to head, Chungyoun observed. He noted that the launch of this dataset means those researchers now have an opportunity to learn which model properties improve performance. That can serve to illuminate the connection between what models learn and how those models can be improved to better predict those properties. The dataset won’t remain static either: more models and properties will be added in the future. "The database has the potential to surface models and tools that may have previously gone unrecognized — research published in lesser-known venues or work that simply didn't receive the attention it deserved," said Nina Cheng, a senior science manager in the AWS Life Sciences organization. "This database can play a key role in bringing that kind of overlooked work to light." Acknowledgements Amazon Bio Discovery Science and product team: Luca Giancardo, Yue Zhao, Melih Yilmaz, Kemal Sonmez, Lan Guo, Gordon Trang, Edward Lee, Chuanyui Teh, Fangda Xu, Nina Cheng, Jiwon Kim.

How Amazon uses agentic AI for vulnerability detection at global scale

Wed, 08 Apr 2026 16:17:20 GMT

In 2025, the National Vulnerability Database published more than 48,000 new common vulnerabilities and exposures (CVEs), reflecting the impact of automated and AI-powered tools on vulnerability discovery. For security teams, however, knowing about new vulnerabilities isn’t enough; they must translate each disclosure into robust detection logic fast enough to protect large, complex systems. At AWS, we built RuleForge, an agentic-AI system that generates detection rules directly from examples of vulnerability-exploiting code, achieving a 336% productivity advantage over manual rule creation while maintaining the precision required for production security systems and enhanced customer security. Closing the gap between disclosure and defense At Amazon, detection rules are written in JSON and applied to data such as requests to MadPot, a global “honeypot” system that uses digital decoys to capture the behavior of malicious hackers, and likely exploit attempts flagged by our internal detection system, Sonaris. We expect the number of high-severity vulnerabilities published to the NVD to continue to grow, which means that AI-powered automation is essential for security at scale. By automating rule generation, we’re closing that gap while expanding our coverage. Our teams can now turn high-severity CVEs into validated detection rules at a pace and scale that would be impossible with traditional methods, providing more comprehensive protection for customers. The manual-detection rule workflow Before RuleForge, creating a detection rule for a new CVE was a multistep, analyst-driven process: Download and analyze. A security analyst located publicly available proof-of-concept exploit code — code that demonstrates how to trigger a vulnerability — and studied it to understand the attack mechanism, inputs, and expected behavior. Write detection logic. The analyst authored a rule to catch malicious traffic targeting the vulnerability, then wrote queries to measure the rule's accuracy against traffic logs. Validate and iterate. The analyst ran those queries, reviewed the results, tuned the rule to reduce false positives, and repeated until the rule performed well enough for production. Peer review and deploy. Finally, the analyst submitted the rule for code review by another security engineer before deployment. This workflow produced high-quality rules, but the time investment meant the team had to carefully prioritize which vulnerabilities to cover first. Reframing rule creation as an agentic-AI pipeline RuleForge reimagines this workflow as an agentic-AI system — a set of specialized AI agents that collaborate to generate, evaluate, and refine detection rules, with humans remaining in the loop for final approval. Rather than attempting to solve the end-to-end problem with a single model, RuleForge decomposes the task into stages that mirror how human experts work: Automated ingestion and prioritization. RuleForge downloads publicly available exploit proof-of-concept code demonstrating how to target a specific vulnerability. It scores each exploit using content analysis and threat intelligence sources. This ensures that rule generation focuses on the threats that matter most. Parallel rule generation. For each prioritized CVE, a generation agent running on AWS Fargate with Amazon Bedrock proposes multiple candidate detection rules in parallel. Each candidate can be refined across several iterations based on feedback from later stages, enabling the system to explore different detection strategies before selecting the most promising ones. Instead of relying on one expert working rule by rule, RuleForge treats detection engineering as a pipeline where AI proposes options and humans decide what ships. AI-powered evaluation. A separate evaluation agent reviews each candidate. This is one of RuleForge's key innovations: rather than having the generation model judge its own work, RuleForge uses a dedicated "judge" model to score each rule on two dimensions that human experts use to assess detection rules: Sensitivity: What is the probability that this rule will fail to flag malicious requests described in the CVE? Specificity: What is the probability that this rule targets a feature that correlates with the vulnerability rather than the vulnerability itself? Multistage validation. Rules that pass the judge move through a pipeline of increasingly rigorous tests. Synthetic testing generates both malicious and benign test cases to verify basic detection accuracy. Rules are then validated against traffic logs, such as those from MadPot, to confirm they perform as expected. Rules that fail at any stage get sent back to the generation agent with specific feedback explaining why, creating a closed loop of improvement. Human review and deployment. The best-performing rule enters code review, just as before. A security engineer reviews it, and any feedback goes back to the generation agent for revision. Human judgment remains the final gate before production deployment. Why a separate judge model matters When we asked the rule generation model to report its confidence in its own candidate rules, it thought almost everything it produced was good. This aligns with research showing poor LLM calibration on security topics. The solution was separating generation from evaluation. Using a dedicated judge model reduced false positives by 67% while maintaining the same number of true positive detections. Two main design choices made the judge effective: Negative phrasing improves accuracy. Asking "what is the probability that the rule fails to flag malicious requests?" produces better calibration than asking "what is the probability that the rule correctly flags all malicious requests?" Given that LLMs tend toward affirmation, framing the evaluation as a search for problems yields more honest assessments. Domain-specific prompts outperform generic ones. Simply asking the model to rate its overall confidence in a rule produced poor calibration. The questions that worked encoded what security engineers actually look for: whether the rule targets the vulnerability mechanism itself versus a correlated surface feature and whether the rule covers the full range of exploit variations. The system also generates reasoning chains explaining its scores. We evaluated those reasoning chains against human assessments and found that the AI judge's reasoning matched expert human reasoning for six out of nine rules. For example, when a human evaluator noted, "That SQL injection regex is too loose," the judge had independently determined that "the regex pattern will catch any query parameter with a single quote, which is broader than just the specific vulnerability." Results and what’s next We deployed the confidence scoring system in August 2025, accelerating how quickly our analysts can deploy new detection rules. Over the final four months of the year, RuleForge enabled our team to produce and validate rules 336% faster than it could manually, while maintaining the high accuracy required for production security systems. By shifting analyst focus from authoring to review, we’ve multiplied overall throughput without compromising quality. We’re closing the gap between vulnerability disclosure and defense more effectively than ever before and ensuring that the managed protections that help safeguard customer workloads on AWS are updated faster and cover more high-severity CVEs. RuleForge demonstrates that agentic AI can augment human security expertise at production scale while meeting precision requirements. The key innovations are architectural: separating rule generation from rule evaluation, using multiple specialized agents rather than a single model, and keeping humans in the loop for final approval. As the rate of vulnerability disclosures continues to accelerate, these design principles will help us keep defenses current. For a deeper look at the technical details behind RuleForge, including the evaluation methodology and experimental results, see our paper on arXiv.

Verifying and optimizing post-quantum cryptography at Amazon

Tue, 07 Apr 2026 15:00:00 GMT

Today, secure online communication is enabled by public-key cryptography, primarily RSA and elliptic-curve cryptography (ECC), whose security depends on the assumption that certain computational problems are intractable. However, while believed to be intractable for conventional computers, the problems underlying RSA and ECC may be tractable for sufficiently large quantum computers. “Store now, decrypt later” attacks — which intercept encrypted information and hold onto it until quantum computers can decrypt it — require protection against these attacks long before they become technically feasible. Post-quantum cryptography (PQC) is cryptography running on classical computers but secure in the face of quantum computing. In 2024, following an eight-year standardization effort, the US National Institute of Standards and Technology (NIST) published standard FIPS-203, which specifies the Module-Lattice-Based Key Encapsulation Mechanism, or ML-KEM, as a mechanism for key agreement believed to be secure against attacks from quantum computers. In this post, we describe how Amazon’s Automated Reasoning Group, AWS Cryptography, and the open-source community have collaborated to create an open-source, formally verified, and optimized implementation of ML-KEM, protecting customers against store-now-decrypt-later attacks with the highest assurance and minimal cost. What is good cryptographic engineering? In keeping with Amazon’s customer obsession, we prioritize three goals when working on cryptographic solutions: The security of the customer’s data: Cryptography is notoriously hard to implement securely, and any flaw can endanger the customer’s privacy; The customer experience: Cryptography is a computational tax that we minimize to ensure the lowest cost and best experience for our customers; Our ability to maintain the solution going forward: The less time we need to spend on maintenance, the more we can innovate on behalf of our customers. There are, however, tensions between these goals: Simple code is easiest to maintain and write securely but tends to be slow. Fast code tends to be more difficult to audit and prone to errors. Automated reasoning allows us to resolve these tensions and provide our customers with cryptographic solutions that are secure, fast, and maintainable, all at once. Yet another implementation of ML-KEM? ML-KEM — formerly known as Kyber — is well studied from an implementation perspective: On the one hand, the Kyber reference code provides a clean C implementation that has been scrutinized for years. On the other hand, numerous research papers describe how to optimize ML-KEM for various metrics and platforms. The challenge faced by AWS Cryptography and the Automated Reasoning Group in 2024 was to combine the simplicity of the reference implementation and the optimization potential revealed in the research works in a single production-ready implementation. Around the same time, AWS became a founding member of the Linux Foundation’s Post-Quantum Cryptography Alliance (PQCA), which created the Post-Quantum Cryptography Package (PQCP), “a collection of open-source projects aiming to build high-assurance software implementations of standards-track post-quantum cryptography algorithms”. Therefore, rather than brewing our own code, members of our team joined the PQCP and soon after launched mlkem-native, a high-assurance, high-performance C implementation of ML-KEM aiming to combine the ML-KEM reference implementation with research on optimization and formal verification. Coding, fast and slow Mlkem-native’s modular design combines a frontend covering the high-level logic of ML-KEM with a backend responsible for all performance-critical subroutines. Each subroutine — including the Keccak permutation underlying SHA3 and the number-theoretic transform (NTT) underlying fast polynomial arithmetic — has multiple, highly efficient implementations written natively for specific hardware. In addition to the default C implementation, mlkem-native provides assembly/intrinsics backends for AArch64, x86_64, and RISC-V64. Importantly for maintainability, the interface between frontend and backend is fixed: a developer adding optimizations for a new target architecture implements select backend functionality against the backend specification, while the frontend stays the same. The development of the backend specification turned out to be less obvious than it sounds, as we explain below. Knowing your limits Memory safety A well-known challenge with the C programming language is the risk of buffer overflows: writing past the designated limits of a memory region can corrupt data structures and, when maliciously exploited, lead to unprivileged code execution. The umbrella term for such issues is memory safety. Memory-safe languages such as Rust can limit the impact of out-of-bounds accesses — by, for example, panicking instead of exhibiting undefined behavior — but they don’t prevent the mistake itself. Type safety Another well-known challenge, this time with implementing ML-KEM, is the risk of integer overflows — an aspect of type safety. Like RSA and ECC, ML-KEM relies on modular arithmetic, in which the results of operations are divided by a particular number — in ML-KEM’s case, the prime 3,329, designated MLKEM_Q or just q — and only the remainder is carried forward. The modulo operator is represented by the percentage symbol, %. Logically, if two numbers x and y need adding or multiplying in ML-KEM, one needs to compute (x + y) % q and (x * y) % q; for example, (294 * 38) % q = 11,172 % q = 1,185. Such “eager” arithmetic modulo q, which constantly applies modular reduction to represent data in the “canonical” range {0, 1, 2, … , q-1}, is prohibitively slow. Efficient ML-KEM implementations instead use “lazy” arithmetic modulo q: data is operated on without modular reduction for as long as possible, and only once there is a worst-case risk of overflow does reduction happen. Further, this allows the use of imperfect reduction algorithms such as Montgomery reduction, which are fast but don’t always give fully reduced outputs. In the case of ML-KEM, data modulo q = 3,329 is typically stored in signed 16-bit integers. When dealing with lazy arithmetic across the numerous arithmetic routines in ML-KEM, it is therefore essential to track the worst-case bounds of the data and insert modular reductions where those bounds would exceed the limits of 16-bit integers. Small mistakes in this domain can evade testing — because average bounds tend to be much smaller than worst-case bounds — and then randomly surface in production. Tracking buffer bounds and especially arithmetic bounds is time consuming and error prone: for example, weakening the output bounds of a low-level arithmetic function might lead to a rare arithmetic overflow in an entirely different function. Checking this by hand not only requires meticulous documentation and skilled auditors but also slows down development. In mlkem-native, we use a tool called the C Bounded Model Checker (CBMC) to automatically verify memory safety and type safety at the C level: for every function, we add machine- and human-readable contracts to the source code to specify the bounds of buffers and arithmetic data, and we have CBMC automatically verify that, with respect to those bounds, no buffer overflow or arithmetic overflow can happen. Let’s look at a simple example of modular reduction: Focusing on the relevant parts one at a time: First, note the __contract__( ... ) . Slightly simplified, the memory_no_alias and memory_slice lines specify which memory the code can read and write; this relates to memory safety. The ensures(array_bound(...)) clause relates to type safety: it specifies that the function will guarantee that upon return, the data is within the interval [0, 1, …, q). In the proof, you see the __loop__(invariant(...)), specifying how the loop gradually establishes this bound: in the ith iteration, it holds up to the ith coefficient. Finally, the implementation effectively composes mlk_barrett_reduce and mlk_scalar_signed_to_unsigned_q. CBMC does not look inside these but replaces them with their contracts: You can see that mlk_barrett_reduce first establishes a symmetric output interval (-q/2, …, q/2), and then mlk_scalar_signed_to_unsigned_q maps it to [0,1, …, q). In this instance, it is easy to confirm by eye that the specifications line up in the desired way, but for more complex examples, this is less obvious. Either way, CBMC checks it for us automatically. Going fast, staying safe The CBMC proofs described above establish memory safety and type safety for mlkem-native's C code. However, the most performance-critical parts of mlkem-native — the Keccak permutation and number theoretic transform — are implemented in hand-optimized assembly for AArch64 and x86_64. To gain assurance for the assembly implementations in mlkem-native while maintaining high performance, we use three components: SLOTHY, an assembly superoptimizer; HOL Light, a theorem prover; and s2n-bignum, a verification infrastructure for assembly built on HOL Light. Together, they enable a workflow where developers write clean, maintainable assembly, while deployed code achieves peak performance with formal guarantees of correctness. Writing high-performance assembly by hand creates a fundamental tension: clean, auditable code that clearly expresses the computation is slow, while fast code is dense, microarchitecture specific, and difficult to maintain. SLOTHY resolves this tension by automating microarchitecture-specific optimizations: it converts an assembly program into a constraint satisfaction problem, finds optimal instruction schedules and register allocations using a constraint solver, and outputs optimized assembly. Developers write clean code emphasizing the logic of the computation, and SLOTHY generates the fast code. We prove functional correctness for all AArch64 and x86_64 assembly routines using HOL Light and s2n-bignum. Where SLOTHY is used, the proofs are written to be agnostic to the specific instruction ordering and register allocation; we can therefore reoptimize the code for a specific microarchitecture without having to adjust the proofs. This “post-hoc” verification approach establishes the mathematical correctness of the computation represented by the assembly regardless of how it came about; in particular, SLOTHY is removed from the trusted computing base. Keeping it honest Formal verification is never absolute. Every proof links formal objects — specifications and models — to informal, real-world requirements and systems, and these links introduce gaps. Does the formal specification capture what we actually need? Does the formal model faithfully reflect the real system? Is the proof infrastructure itself sound? Earning and maintaining customer trust requires being transparent about these limits. We therefore developed and published a document titled SOUNDNESS.md, where we map out what is proved in mlkem-native, what is assumed, and where the residual risks lie — from the fidelity of the hardware models used in HOL Light proofs, to the larger trusted computing base of CBMC, to the manual bridge between the two verification stacks. For each gap, we describe mitigations in place and outline future work. Our goal is not to claim perfection but to earn trust through transparency. We encourage the community to read SOUNDNESS.md critically, challenge our assumptions, and help us close the remaining gaps. Getting on the road Mlkem-native is integrated into AWS-LC, Amazon's open-source cryptographic library, which underpins secure communication across AWS services. The integration uses an automated importer that pulls mlkem-native source code directly from the upstream repository, ensuring that AWS-LC stays synchronized with the latest verified implementation. The integration is designed for minimal friction: mlkem-native's modular architecture allows AWS-LC to import the core ML-KEM logic while providing its own implementations of platform-specific components. For example, AWS-LC maps mlkem-native's cryptographic primitives to its existing FIPS-202 (SHA-3) implementation, uses AWS-LC's random-number generation and memory zeroization functions, and enables FIPS-mode features like pairwise consistency tests when required. Enabling this is a thin compatibility layer that bridges mlkem-native's API to AWS-LC's infrastructure without modifying the verified code. Critically, the CBMC contracts that prove memory safety and type safety are preserved in the imported source code. While the preprocessor removes them from compiled binaries, they remain in the source as machine-checkable documentation of the code's guarantees — a form of "living proof" that travels with the implementation. Moreover, because both mlkem-native and AWS-LC are open source and permissively licensed, their benefits extend beyond AWS. Anyone can integrate mlkem-native into their systems and gain the same combination of performance and assurance. The formal verification artifacts — CBMC contracts and HOL Light proofs — are part of the repository, all tools involved are open source, and scripts are provided for setup and proof checking, inviting an independent validation of our security claims. Impact The development of mlkem-native demonstrates that the three goals of cryptographic engineering — security, performance, and maintainability — are not in conflict when automated reasoning is applied systematically. CBMC freed us from manually tracking bounds through complex arithmetic, catching errors that would evade testing and surface randomly in production. The annotations stay in the source code as machine-checkable documentation, making the code simultaneously more maintainable and more secure. HOL Light and s2n-bignum allowed us to deploy aggressive assembly optimizations with mathematical certainty of correctness. SLOTHY let us write clean, auditable code while achieving peak performance for specific microarchitectures. And because the proofs are written to be optimization agnostic, we can retarget the code without redoing the verification. The result is an implementation that is simultaneously more secure, faster, and easier to maintain than what traditional development could achieve. We didn't compromise between customer security, customer experience, and our ability to innovate: automated reasoning delivered all three. AWS-LC-FIPS releasePlatformOperation3.14.0Ratioc7iKeygen30899651462.1Encaps30623612332.0Decaps25141515452.0c7gKeygen29617711342.4Encaps28482668742.3Decaps23919647652.3Performance impact of switching from the ML-KEM reference implementation to mlkem-native in Amazon’s cryptography library AWS-LC. ML-KEM-768 performance is measured on c7i and c7g EC2 instances. The numbers represent operations per second (higher is better). The baseline is an AWS-LC-FIPS 3.1 release that contains the ML-KEM C reference implementation. The AWS-LC-FIPS 4 release is built with mlkem-native. The platforms are c7i with Intel(R) Xeon(R) Platinum 8488C and c7g with Graviton 3 processor. Acknowledgments We thank our colleague John Harrison, senior principal applied scientist at the Automated Reasoning Group, for providing the bulk of the AArch64 assembly proofs in HOL Light and for maintaining the HOL Light interactive theorem prover and the s2n-bignum verification infrastructure. Mlkem-native is a collaborative effort involving not only AWS but many members of the open-source community. Foremost, we thank our co-maintainer Matthias Kannwischer from zeroRISC, who started mlkem-native with us and has since been instrumental in the success of the project.

Improving quality and robustness in LLM-based text-to-speech systems

Wed, 01 Apr 2026 18:13:19 GMT

Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist. One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice. Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging. And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation. At Amazon, we're working to address all these issues. Mitigating accent leakage in polyglot TTS We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but without loss of speaker identity. Improving expressiveness We use classifier-free guidance (CFG) to generate synthetic reference audio samples with enhanced expressiveness. Using these as conditioning during inference pushes the model toward more expressive prosodic styles. Originally developed for diffusion modeling, CFG controls how strongly generation follows conditioning. CFG-based reference samples decouple speaker identity from accent, teaching the model to preserve voice characteristics while adopting native pronunciation in the target language. This allows us to scale a small number of recorded voices to many new locales and languages, while increasing expressiveness. Scored according to MUSHRA (multiple stimuli with hidden reference and anchor) listening tests, the quality of our models’ polyglot outputs across nine locales spanning English, French, Italian, German, and Spanish improved 5% to 20% over those of our previous model family. LocaleImprovement over baselineUS-English+12.43%Southern US-English+20.05%Great Britain-English+5.97%Australia-English+5.50%US-Spanish+11.78%Spain-Spanish+13.23%France-French+8.44%Germany-German+14.12%Italy-Italian+9.80%Robustness Traditional TTS had failure modes, but hallucination and random truncation weren't chief among them. LLM-based TTS can generate confident-sounding speech that doesn't match the input, and it will sometimes stop mid-sentence. Chain-of-thought for autoregressive TTS Traditional TTS pipelines have explicit stages: grapheme-to-phoneme conversion, duration prediction, and acoustic generation. More recent, non-autoregressive end-to-end models like FastSpeech predict durations explicitly before speech generation. LLM-based TTS takes an alternate approach. Duration emerges implicitly from autoregressive generation. There's no explicit plan for how long the utterance should be or how long each phoneme should take. This is why these models hallucinate (keep generating past the intended content) or truncate (stop too early). To address this problem, we add chain-of-thought reasoning to the model: before generating speech tokens, the model predicts phoneme sequences and estimates duration (total length and per-phoneme timing). This isn't the same as traditional TTS pipelines. Bolting duration prediction onto an autoregressive architecture is a different problem than building it into a non-autoregressive one, and it has its own challenges. Phoneme prediction enables the model to handle heteronyms ("read," "lead") and unusual names more reliably. Duration prediction gives the model a timing plan, which reduces both hallucination and truncation. These predictions are also useful for debugging, as you can see what the model "thought" it was going to generate before it started generating. Guardrails Our guardrails use the chain-of-thought predictions as checkpoints. We know the expected phoneme count and approximate speech duration before generation starts. After generation, we do a pair of checks: does the output duration match the prediction, and is the output length reasonable given the phoneme count? Large deviations flag likely hallucinations or truncations. When an agent detects problems, it can prompt the TTS system to regenerate with different sampling parameters or fall back to alternative approaches. Data filtering To filter the text data passing to the TTS model, we combine speech-recognition-based metrics with metrics based on the LLM’s attention mechanism. Automatic speech recognition (ASR) catches actual transcription errors. Taken together, the metrics keep data that's genuinely well aligned while preserving expressiveness that ASR-only filtering would discard. On generic long-form text, our full array of techniques reduces critical errors to an average of less than one second per hour, where “critical errors” include hallucinations, cutoffs beyond one word, and mismatches between input text and output speech. Conclusion LLM-based TTS models sound noticeably more natural than traditional systems. However, in our experience, they introduce new failure modes that need to be addressed before they can be deployed reliably in production. We have found that LoRA-based fine tuning addresses the heavy accent leakage observed in polyglot TTS, while classifier-free guidance is a useful tool for improving expressiveness. As for reliability, we find that smart data filtering and chain-of-thought reasoning coupled with guardrails and agentic regeneration can significantly reduce hallucination.

Formally verified AES-XTS: The first AES algorithm to join s2n-bignum

Fri, 20 Mar 2026 16:38:44 GMT

Cryptographic encryption algorithms are mathematical procedures that transform readable data into ciphertext that looks like a stream of random bits. The ciphertext can be decrypted only with the corresponding decryption algorithm and the correct key. For data at rest — information stored on disks or in databases — algorithms like AES-XTS encrypt each block before it’s written to storage, protecting against physical theft or unauthorized access to storage systems. For data in transit — information traveling across networks — protocols like TLS combine multiple algorithms: asymmetric encryption algorithms (RSA or elliptic curves) establish secure connections, while fast symmetric encryption algorithms (like AES-GCM) protect the actual data stream and verify that it hasn't been tampered with. At Amazon Web Services (AWS), we use AES-XTS to protect customer data in services like EBS, Nitro cards, and DynamoDB, while TLS with AES-GCM secures all network communication between services and to customers. We took on the challenge of formally verifying an optimized Arm64 assembly implementation of AES-XTS decryption, where “formal verification” is the process of proving mathematically that an engineered system meets a particular specification. Our work follows the IEEE Standard 1619 for cryptographic protection of block-oriented storage devices and focuses on the AES-256-XTS variant of AES-XTS. The “256” specifies the size of the encryption key. Unlike algorithms that process fixed-size blocks, AES-XTS handles variable-length data from 16 bytes up to 16 megabytes, with special logic for incomplete blocks. The assembly code verified was a 5x-unrolled version, meaning that its loops were executed in parallel across five registers (each containing an input block), and it had been optimized for modern CPU pipelines. It was complex enough that manual review couldn't guarantee correctness yet critical enough that errors could compromise customer data security. As part of Amazon Web Services’ s2n-bignum library of formally verified big-number operations, we contributed an improved Arm64 assembly implementation of AES-XTS encryption and decryption, as well as specification and formal verification using the HOL Light interactive theorem prover, which was developed by a member of our team (John Harrison). This was an experiment in the proof-driven development of a large function with multiple paths based on the input length. It resulted in the largest proof so far in the s2n-bignum library. For the typical input size of 512 bytes, the performance of the algorithm either stayed close to that of original code or improved slightly on highly optimized Arm cores. By adding this algorithm and its proof to the s2n-bignum library, we pave the way for more AES-based algorithms to be added. Description of the algorithm AES is a block cipher that implements a keyed permutation. This means that it processes the plaintext files in blocks (in this case, blocks of 128 bits), and for any given key, it defines a bijective (one-to-one and invertible) function mapping each plaintext block to a unique ciphertext block. This mathematical property ensures that decryption can uniquely recover the original plaintext. AES-XTS is the mode specifically designed for storage encryption. It uses AES as its underlying building block but adds position-dependent tweaks and ciphertext stealing — a method for handling partial blocks — to address the unique requirements of disk encryption, where you need random access to any sector and must preserve the exact data size. AES-XTS encrypts storage data using a two-key approach where each 128-bit block and its position-dependent tweak are subjected to an exclusive-OR operation (XOR), a binary operation that outputs a one only if the input values differ. The result of the operation is encrypted with AES, then XORed with the tweak again, ensuring that identical data at different disk locations produces different ciphertext. The tweak is generated by encrypting the sector number with a second key, then multiplying by powers of α in a Galois field, creating unique values for each block position. When the final block isn't a full 128 bits, ciphertext stealing kicks in. Ciphertext stealing borrows bytes from the previous block, allowing encryption of data of any length without padding or wasted space. This lets you read or write any sector independently — critical for disk encryption — while basing each block's encryption on its position. That is a desired feature since the security model of disk encryption allows the adversary to access sectors other than those in question, modify them, and request their decryption. It also ensures that the size of the ciphertext is exactly the same as that of the plaintext, so it fits in its place. Control flow of the assembly implementation We started from an existing implementation of AES-XTS in Amazon’s AWS-LC cryptographic library. AES-XTS loops through the plaintext in 128-bit blocks, and encryption of each block requires 15 steps, each with its own “round key” derived from the encryption key. The existing implementation is 5x unrolled, meaning it processes blocks in parallel, five at a time. If the final block is less than 128 bits in length, there’s a risk of “buffer overread”, or reading beyond the limits of the input buffer. To avoid overread, the existing implementation does complex manipulation over the pointer to the current location in the input buffer. This requires a sophisticated control flow that can be hard to follow: the loop counter is incremented and decremented multiple times before and during the loop, and the loop has two additional exit points other than the final loop-back branch. One exit point is for the case when four blocks remain during the final iteration of the loop; the other exit point is for the case of one to three blocks remaining. The flow of the loop interleaves the load/store instructions with the AES and XOR instructions in an effort to maximize pipeline usage. After the loop exit, the processing of the remaining blocks is intertwined for the lengths of four down to one; if there’s a partial block at the end, the algorithm performs the ciphertext-stealing procedure. Additionally, only seven of the 15 rounds’ keys were kept in registers; the other eight were repeatedly loaded from memory in the loop and outside it. We first investigated whether we could improve the performance of the main loop by letting SLOTHY, a superoptimizer for Arm code, rearrange the instructions to maximize pipeline usage. SLOTHY contains simplified models of various Arm microarchitectures. It uses a constraint solver to provide optimal instruction schedule, register renaming, and periodic loop interleaving. However, for SLOTHY to identify and optimize a loop, the loop has to exhibit typical loop behavior, decreasing the counter at the end of each iteration and then jumping back to the beginning. SLOTHY also cannot handle the nested loop created by loading the eight-round keys. This gave us a reason to start “cleaning up” the loop. First, we freed up registers to permanently hold all round keys; this was possible as the optimized order of instructions required fewer temporary registers than the original code. Second, we removed the multiple exit points and the manipulation of the loop counter to handle the remaining blocks. The value of the counter always indicates the number of five-block chunks remaining in the buffer, conforming to SLOTHY’s requirement; the loop ends before the handling of the remaining blocks. Once the loop ends, we have a separate processing branch to handle each possible number of remaining blocks, from one to four; all four branches end in ciphertext stealing. This can be seen in the control flow graphs of the encrypt and decrypt algorithms (see below). Throughout the code, we maintained the constant-time design mindset; that is, branching and special processing are based not on secret data but only on public values, the input byte lengths. Performance With our modifications to the code, we were able to use SLOTHY to optimize the encrypt algorithm. This resulted in slight performance gains on the AWS Graviton family of Arm processors, although the gains were smaller on the more advanced chips, which have an optimized out-of-order pipeline. Compared to the original implementation, keeping round keys in registers throughout the algorithm’s execution, to save on loading them from memory, allowed us to offset the effects of not interleaving the AES instructions with other ones. Having a cleaner flow of instructions in the loop and modular exit processing allowed us to experiment with various unrolling factors for the loop iterations. We experimented with 3x, 4x, and 6x factors and concluded that 5x is still the best choice across various microarchitectures. Ensuring correctness through formal verification To deploy optimized cryptographic code in production, we need mathematical certainty that it works correctly. While random testing quickly checks simple cases, we rely on formal verification to deliver the highest level of assurance for our AES-XTS implementation. Why HOL Light for AES-XTS? To prove that our implementation matches the IEEE 1619 specification, we use HOL Light, an interactive theorem prover developed by our colleague John Harrison. HOL Light is a particularly simple implementation of the "correct by construction" approach to software development, in which code is verified as it’s written. HOL Light’s trusted kernel is just a few hundred lines of code, which implements basic logical inference rules. This means that even if there's a bug in our proof tactics or automation, it cannot cause HOL Light to accept an incorrect proof. At worst, a bug prevents us from completing a proof, but it cannot make a false statement provable. We chose HOL Light for several reasons specific to AES-XTS verification: Assembly-level verification: We write our implementations directly in assembly rather than relying on compiled code. While more challenging, this makes our proofs independent of any compiler. HOL Light reasons directly about machine code bytes using CPU instruction specifications, providing assurance at the lowest level of the software stack. Existing cryptographic infrastructure: S2n-bignum already provides extensive support for cryptographic verification, including symbolic simulation that strips away execution artifacts and leaves purely mathematical problems, specialized tactics for word operations, and byte list handling. We add proven lemmas about AES operations that we can reuse for the proofs of other AES modes. Complex control flow handling: Unlike fully automated provers that might fail on complex proofs without enough explanation, HOL Light's interactive approach lets us guide proofs through the intricate invariants required for our 5x-unrolled loops, processing arbitrarily long blocks of data and performing the complex memory reasoning required by variable-length inputs and partial blocks. The s2n-bignum framework Using s2n-bignum to implement AES-XTS serves two purposes: it's both a framework for formally verifying assembly code in x86-64 and Arm architectures and a collection of fast, verified assembly functions for cryptography. The library already contains verified implementations of numerous cryptographic algorithms, especially those pertaining to big-number mathematical operations (hence the name), which are the foundation of public-key cryptographic primitives. For details on how HOL Light was used to prove public-key algorithms as part of s2n-bignum, please refer to the previous Amazon Science blog posts “ Formal verification makes RSA faster — and faster to deploy” and “Better-performing ‘25519’ elliptic-curve cryptography”. As we mentioned, AES-XTS is one of the modes of the AES block cipher. AES is based on a substitution-permutation network (SPN) structure, which combines substitution operations (SubBytes using the S-box), permutation operations (ShiftRows, MixColumns), and key mixing. By expanding s2n-bignum to include the AES instruction set architecture (ISA) found in Arm64 and x86_64 processors, specifications for the AES block cipher, and additional specifications for AES-XTS, we're paving the way for the same rigorous verification of more AES-based algorithms. Developing and testing the specification The SPN nature of AES and the modes that are based on it cannot be expressed using simple mathematical formulae — such as modular multiplication, which is fundamental to public-key cryptography — that can be innately understood by a theorem prover. They require writing descriptions of the steps for processing the data. This is why, before verifying the assembly, we needed confidence that our HOL Light specification accurately captured the IEEE standard. We wrote the specification to mirror the standard's structure, using byte lists for input/output and 128-bit words for internal block operations. Then we developed conversions, HOL Light functions that we used to evaluate specifications with concrete inputs while generating proofs that the evaluations are mathematically correct. We validated our specification by conducting unit tests that cover different AES-XTS encryption/decryption scenarios, exercising the processing of all blocks (using recursion) and ciphertext stealing. These tests confirmed that our specification matched the IEEE standard before we tackled the more complex assembly verification. This two-phase approach — first ensuring that the specification is correct through testing, then formally verifying that the implementation matches the specification — gave us confidence we were proving the right thing. The proof strategy Our proofs are compositional, meaning they break the overall problem into subproblems that can be proved separately. Depending on the subproblem, the subproofs can be bounded — true only for a range of inputs — or unbounded. For inputs with fewer than five (or six, in the case of decrypt) blocks, we wrote bounded proofs that exhaustively verify each case. For inputs with five (six, in the case of decrypt) or more blocks, we developed loop invariants — mathematical statements that remain true throughout loop execution — to prove correctness for arbitrarily long inputs. The loop invariants track three critical factors until the loop exit condition is met: register states at each iteration, the evolution of "tweaks" (which make each block's encryption unique), and memory contents as blocks are processed. For partial-block (tail) handling, we proved a separate theorem for ciphertext stealing that could be reused across all cases. The top-level correctness theorem composes all proofs together, asserting the following statement: Memory safety and constant-time proofs Most recently, s2n-bignum was equipped with new functions and tactics for formally defining the constant-time and memory safety properties of assembly functions. With these resources, many assembly subroutines in s2n-bignum were verified to be constant time and memory safe, including top-level scalar-multiplication functions in elliptic curves, big-integer arithmetic for RSA, and the Arm implementation of the ML-KEM cryptography standard (the subject of a forthcoming blog post on Amazon Science). All assembly subroutines identified for use in AWS-LC as of October 2025 were formally verified to be constant time and memory safe. We are exploring whether the new tactics can easily be used to verify assembly subroutines that have subsequently been added, such as AES-XTS. As we mentioned, AES-XTS has a remarkably complex control flow, which resulted in a long and involved functional-correctness proof. That complexity is also a challenge for safety proofs. The process is ongoing, but we have already proved safety properties for the ciphertext-stealing subroutines of the decryption and encryption algorithms. These first proofs focused on crucial memory access procedures that are prone to buffer overread. Proofs for the remaining parts of the decryption and encryption algorithms can use the same methodology, where the constant-time and memory-safety proofs follow the same structure as the functional-correctness proofs but are simpler, since their proof goal is more focused. Continuous assurance of correctness We've integrated formal verification into s2n-bignum's continuous-integration (CI) workflow. This provides assurance that no changes to our AES-XTS implementation can be committed without successfully passing a formal proof of correctness. As part of CI, the CPU instruction modeling is validated through randomized testing against real hardware, "fuzzing out" inaccuracies to ensure our specifications are correct and the proofs hold in practice. Furthermore, the proof guarantees correctness for all possible inputs, since they’re represented in the proof as symbols. This overcomes the typical shortcoming of coverage testing, which may cover all paths of the code but may not be able to cover all input values. For example, a constant-time code, like the one used here, is written without branching on secret values. Typically, then, secret values are incorporated into the operation through the use of masks derived from them. The same instructions are executed irrespective of the secret value. Hence, achieving line coverage is usually within the reach of a developer, but achieving value coverage is left to the formal verification of correctness. This same methodology has enabled AWS to deploy optimized cryptographic implementations with mathematical guarantees of correctness while achieving significant performance improvements. This allows the developer and tools to further optimize the code freely without worrying about introducing bugs, since these will be automatically caught by the proof. Our experience with AES-XTS shows that proof-driven development of assembly code yields a control flow that is easier to understand, review, maintain, and optimize while never ceasing to be provably correct.

Optimizing LoRA target module selection for efficient fine tuning

Thu, 19 Mar 2026 14:39:23 GMT

Fine-tuning a large language model (LLM) on a specific task requires updates to billions of parameters across trillions of tokens, with the attendant costs in GPU resources and time. Low-rank adaptation (LoRA) is a more efficient alternative that freezes the original model weights but introduces lightweight matrices into specific model sublayers, or “modules”. These matrices (commonly referred to as “adapters”) modify the modules’ weights, enabling not only efficient fine tuning but also on-demand model serving, which dramatically lowers inference costs; base-model sharing across GPUs, which cuts memory requirements; lower download overhead; and parallel inference across multiple adapters. The question is where to insert these adapters across the model. Empirically, targeting more and larger modules tends to boost performance, because it allows more flexibility in customization; but it also increases training and inference costs. Using a smaller, well-chosen subset preserves most gains with significantly better efficiency. Using Amazon’s Nova 2.0 Lite multimodal reasoning LLM as our base model, we set ourselves the goal of identifying a subset of standardized target-module configurations that works effectively across the vast majority of customer use cases. Through an ablation study, we identified a module known as o_proj, as the single module where adding an adapter achieves the best trade-off between efficiency and accuracy (o_proj is a linear transformation that mixes representations across attention heads into a single, cohesive form for the rest of the model to understand). The Transformer architecture Transformer models — the models responsible for all of AI’s remarkable recent gains — consist largely of blocks that are repeated multiple times. Each block in turn has two main components: an attention mechanism, which determines the relevance of previously seen tokens to the token currently being processed, and a feed-forward network, a conventional neural network that does additional processing on the outputs of the attention mechanism. The attention mechanism involves three different matrices, which take their names from database design: the query matrix represents how relevant the current token is to the other tokens in the input sequence; the key matrix represents how relevant other tokens are to one another; and the value matrix represents the raw content of those other tokens. Multiplying the three matrices together creates, essentially, a recipe for the Transformer's next output. To reduce computational complexity, these multiplications take place in a space with reduced dimensions. The matrices themselves and the results of their multiplication then have to be projected back up to the original dimensions of the input. LoRA approximates weight updates using a product of two smaller matrices, drastically reducing the number of trainable parameters. The technique is typically applied to attention projection layers and feed-forward network layers. These modules are ideal candidates because they constitute the bulk of Transformer parameters, directly govern representation learning, and exhibit natural alignment with low-rank approximations. Empirical evidence shows weight changes in these layers often lie within a low-dimensional subspace during fine tuning. Target module selection Selecting the right target modules directly affects accuracy, latency, and computational efficiency. The optimal choice of target modules is primarily a function of (a) the base model being fine-tuned (i.e., its architecture, pre- and post-training data distributions, etc.) and (b) customization domain/modality. When fine-tuning Nova 2.0 Lite, we balanced two competing objectives: Maximizing accuracy across diverse tasks and modalities and Minimizing latency to preserve LoRA's efficiency benefits. We investigated the application of LoRA to four different modules in each Transformer block: the query, key, and value projection layers ( qkv); the o_proj layer; and two different fully connected layers in the feed-forward network, gate_up_proj and gate_down_proj (referred to as fc1 and fc2). Below are the trade-offs for these modules, both singly and in combination, based on results published in literature and empirical studies. CombinationExpected accuracyExpected latencyUse caseqkv onlyGood (baseline)Lowest Resource-constrained environments Tasks where attention mechanisms are critical (e.g., classification, lightweight generation) Prioritizes speed over maximum accuracy o_proj onlyModerateLowest Ultralow-latency scenarios Tasks where refining attention outputs is sufficient (e.g., simple sentiment analysis). Plays an important role in reasoning Less effective than qkv, but very efficient qkv + o_projHighLow to moderate (+5–10%) Attention-focused tasks (e.g., machine translation, summarization) Balances refinement of both attention context ( o_proj) and query/key/value projections ( qkv) Best accuracy-to-latency ratio for most NLP tasks qkv + fc1 / fc2Very high (close to full fine tuning)Moderate (+10–15%) Complex generation tasks (e.g., translation, long-form summarization) When feed-forward layers ( fc1/ fc2) significantly influence output quality as they store and retrieve factual knowledge Prioritizes accuracy over speed o_proj + fc1 / fc2Good to highModerate (+5–10%) Tasks requiring adaptation of both attention output ( o_proj) and feed-forward layers (e.g., text classification, sentiment analysis) Suitable when qkv adaptation is unnecessary qkv + o_proj + fc1 / fc2Highest (near-full fine tuning)High (+15–20%) Maximum accuracy for critical tasks (e.g., research benchmarks, high-stakes generation) When all components of the Transformer block need adaptation Avoid for production if latency matters All modules ( qkv, o_proj, fc1, fc2)MaximumHighest (+20–25%) Prototyping/research with no latency constraints Rarely justified in practice; marginal gains over qkv + o_proj + fc1/ fc2 Trade-offs of accuracy and latency across target modules, based on literature review and empirical evidence. Experimental methodology We conducted a comprehensive ablation study, training multiple supervised-fine-tuning (SFT) LoRA variants on seven datasets spanning both text and visual data, across reasoning (i.e., the training datasets themselves include reasoning content) and non-reasoning tasks. The datasets covered diverse challenges from simple question answering to long-context summarization and structured JSON extraction. DatasetModalityReasoning tracesDomainTasksTraining sizeEval sizeEval metricSourceFinCOTTxtYesFinanceFinancial-reasoning dataset. Samples consist of complex financial queries, along with reasoning traces obtained from GPT-4o. Predictions are typically complex tables or calculations based on the input.74361147Accuracyhttps://huggingface.co/datasets/TheFinAI/FinCoTGovReportTxtNoGoverment DocLarge-context (30-40K tokens) summarization17457837RougeLsumhttps://gov-report-data.github.io/MedMCQATxtNoMedicalDataset for multiple-choice QA — also used in Nova 1.020k3683Accuracyhttps://huggingface.co/datasets/openlifescienceai/medmcqaMedReasonTxtYesMedicalMedical-reasoning dataset that consists of questions and answers compiled from various medical benchmarks (MedQA, MedMCQA, etc.), along with synthetic, high-quality reasoning traces. (This uses the same eval set as MedMCQA.)316823683Accuracyhttps://huggingface.co/datasets/UCSC-VLAA/MedReasonCoCoHDTxtNoPolitical DocA complex benchmark consisting of large-context (>20K tokens) transcripts of congressional hearings. The output is expected to be a summary in a specific JSON format, consisting of the members present, topic discussed, outcomes, etc.7321053Averaged key and value match ratehttps://github.com/gtfintechlab/CoCoHDLlava-COTImageYesImage understanding, General/ScienceMultimodal, image benchmark consisting of Q&A reasoning questions. The dataset includes high-quality reasoning traces.10k270Exact match ratehttps://huggingface.co/datasets/Xkev/LLaVA-CoT-100kInvoice OCRImageNoImage understandingOCR benchmark that takes an input image and produces a JSON file with fields from the image.1400447AccuracySummary of the experiment datasets All experiments used the Nova 2.0 Lite general-availability checkpoint with consistent hyperparameters across target modules, including learning-rate ratio and alpha values. Target datasetSettingSFT LoRA target performanceNova 2.0 Lite performanceFin-COTqkv67.09%72.12%o_proj68.30%fc175.35%fc260.24%o_proj + fc161.38%qkv + fc260.31%o_proj + fc262.79%qkv + fc168.37%All target modules66.15%CoCoHDqkv19.64%45.14%o_proj65.88%fc141.96%fc217.62%o_proj + fc176.83%qkv + fc266.47%o_proj + fc279.14%qkv + fc145.45%All target modules82.75%GovReporto_proj41.25%38.90%fc139.69%o_proj + fc141.74%o_proj + fc242.16%qkv + fc141.66%qkv + fc239.02%All target modules41.95%Llava-COTqkv64.26%16.22%o_proj64.26%fc165.92%fc265.02%o_proj + fc163.21%qkv + fc262.76%o_proj + fc266.37%qkv + fc166.52%All target modules63.96%Invoice OCRo_proj89.07%14.10%o_proj + fc190.03%qkv + fc287.84%o_proj + fc289.47%qkv + fc188.55%All target modules90.11%MedReasono_proj24.55%1.68%o_proj + fc120.88%qkv + fc28.39%o_proj + fc220.36%qkv + fc14.32%All target modules26.72%MedMCQAqkv62.18%1.68%o_proj63.10%fc112.90%fc259.98%o_proj + fc161.39%qkv + fc265.63%o_proj + fc264.95%qkv + fc157.21%All target modules66.11%Ablation study for target module selection. Some benchmarks have fewer variations, to save on computation and time. MedMCQA and MedReason use the MedMCQA test set for evaluation. On this task, Nova 2.0 Lite fails mainly due to formatting inconsistencies, even though it produces the right answer. For consistency’s sake, we use the same strict parser for SFT models. Key findings 1. O_proj is the most robust single target The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration (i.e., using all target modules). On MedMCQA, CoCoHD, GovReport, LLaVA-CoT, and Invoice OCR, o_proj-only either matched or came very close to optimal performance, making it an attractive default choice that balances performance and simplicity. There is emerging evidence that this module plays a key role in reasoning, which may explain its effectiveness here. 2. Qkv-only shows instability While qkv-only performed well on MedMCQA, it exhibited extreme variability, performing below baseline on CoCoHD and showing unremarkable results elsewhere. This aligns with the hypothesis that attention-only LoRA can underfit on tasks requiring richer features from the feed-forward network, rather than relying on modified token routing. 3. Module combinations provide modest gains Combinations like o_proj + fc2 or "all target modules" often achieved the highest per-dataset scores (particularly on CoCoHD, MedReason, and Invoice OCR). However, improvements over the best single module were typically modest, usually 1-3 percentage points. 4. Task difficulty amplifies configuration impact On challenging benchmarks where the base model performed poorly, the choice of target modules had greater impact. For example, on CoCoHD (long-context, complex JSON generation), o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone. 5. LoRA consistently outperforms base models Across nearly all datasets, any reasonable LoRA configuration dramatically outperformed the base model. For instance, MedReason, MedMCQA, LLaVA-CoT, and Invoice OCR showed improvements from a baseline accuracy of ~1-16% to 60-90%+ with LoRA. The notable exception was Fin-COT, where only certain configurations (notably fc1) exceeded baseline performance, suggesting task-specific sensitivity to adaptation strategy. Recommendations For accuracy-prioritized scenarios, we recommend o_proj + fc2 as the optimal configuration for both text and multimodal tasks, showing 2-12% improvements over o_proj alone across benchmarks. For balanced efficiency and performance, o_proj-only provides an excellent default, offering robust performance with minimal latency overhead — particularly valuable when serving multiple adapters or operating under resource constraints. For challenging tasks, such as benchmarks with long context or complex generation requirements or other tasks where base models struggle, the additional accuracy from o_proj + fc2 justifies the modest latency increase. Future directions Our research opens several promising avenues for further optimization: Modality and task-specific configurations: Segmenting target module selection by modality and task difficulty (e.g., long-context scenarios) could yield specialized configurations with better accuracy-latency trade-offs. Per-module hyperparameter optimization: Extensive hyperparameter optimization for each target module configuration could unlock additional performance gains, though computational costs remain a consideration. Two-stage LoRA for early candidate identification: Leveraging two-stage LoRA approaches that use training dynamics, gradients, etc., to determine the importance of different modules/layers could help identify promising configurations early in training, reducing the cost of comprehensive hyperparameter searches. Layer pruning for latency reduction: Using two-stage training to identify and prune unused layers could further reduce inference latency while maintaining accuracy. Conclusion Our comprehensive study demonstrates that thoughtful target module selection in LoRA fine tuning can improve accuracy while preserving the efficiency advantages that make LoRA attractive for production deployments. The o_proj layer emerges as a remarkably robust single target, while o_proj + fc2 combinations offer the best accuracy for challenging tasks. On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms). These findings provide a principled foundation for standardizing LoRA configurations across diverse customer use cases, balancing the competing demands of model performance and computational efficiency. Acknowledgements: Kevin Rondinone, Kevin Chen, Nicole Ding, Sebastian Massella, Andy Li

How agentic AI helps heal the systems we can’t replace

Mon, 16 Mar 2026 13:00:00 GMT

Many of the world’s most important systems — the ones that move money, book flights, issue licenses, and process claims — are slow, brittle, and deeply outdated. Built decades ago and extended repeatedly, they now sit at the center of workflows too vital to pause, take offline, rebuild, or replace. Inside Amazon’s Artificial General Intelligence (AGI) Lab, teams train agents not on idealized interfaces but on high-fidelity simulations of such legacy systems. Learning the real behaviors of these systems — the quirks, delays, error states, and invisible dependencies — makes possible a different kind of innovation, one that grows from the systems we have instead of requiring their replacement. And by managing the idiosyncrasies of legacy systems behind the scenes, the agent effectively becomes a universal API — a single interface that the customer can use to perform a wide range of special-purpose tasks. The legacy systems that power everyday life Step behind the scenes of any large institution — a bank, an insurer, a hospital, a state agency — and you’ll find the same thing: an invisible layer of human labor holding software together. People know which buttons must be clicked in which order, which warnings can be ignored, which fields must be entered twice, and which screens must never be refreshed. The institutional knowledge required to navigate these eccentricities is passed down like the oral traditions of legacy systems. Much of the infrastructure beneath these workflows is older than the people managing it. The software backbone of modern finance, insurance, travel, scientific research, and public services took shape in the 1960s and ’70s, built on mainframe architectures and written in languages like COBOL and FORTRAN — designed for stability rather than adaptability. When the web arrived, institutions didn’t rebuild. They wrapped. Web forms fed mainframe jobs, middleware translated modern inputs into decades-old formats, and enterprise portals accumulated atop business rules that were never rewritten. Over time, modernization settled into layers: a mainframe instruction set at the bottom; a 1990s database above it; a 2000s portal above that; and a modern web interface masking everything beneath. A single transaction today might pass through all these layers — scripts, connectors, and integrations holding them together in ways no one fully understands. Attempts to replace these systems routinely stall. Dependencies surface no one knew existed, migrations fail, budgets spiral, and public-sector modernization efforts collapse under their own complexity. These systems cannot be taken offline, which means institutions must keep operating them no matter how brittle they become. For Amazon, this is one of the most compelling applications of agentic AI — navigating not the polished surfaces of web-era consumer apps but the deep, slow-moving architectures that keep institutions running. Learning the bad to heal the bad The hardest part of training an AI agent is not teaching it what a successful workflow looks like; it’s teaching it why workflows fail. The logic behind legacy systems reveals itself most clearly through friction: the modal (mandatory) window that appears late because it encodes a sequencing rule; the field that refuses input until another value is saved; the form that resets because a backend job restarted midflow. These behaviors aren’t glitches. They are the real semantics of the system. Researchers at Amazon’s AGI Labs seek this friction out. To surface failure modes safely and repeatedly, Amazon trains agents inside reinforcement learning (RL) gyms — synthetic environments designed to reproduce the quirks, delays, and ordering rules embedded in real workflows. These include synthetic web environments that simulate systems like state agencies, airline bookings, and specialized tax- and benefits-processing, among hundreds of others. Jason Laster, an AGI software engineer who works on agent training and replay systems, puts it plainly: “I want to push our RL training gyms to have all of the warts, all of the issues.” This is what “learning the bad to heal the bad” means: training an agent on the full spectrum of a system’s true behavior, including flaws, inconsistencies, delays, and all the embedded histories humans have quietly adapted to. By exposing agents to the same brokenness people navigate every day, Amazon trains them to move beyond surface correctness and understand the deeper logic beneath the interface. Agents as a new interface layer Once an agent can reliably navigate the idiosyncrasies of legacy interfaces, something more interesting begins to happen. Researchers have observed agents inferring not just what to click next but why — the latent workflow the interface expresses. In one simulated benefits application environment, an agent that realized it had added only one dependent was able to navigate back, correct the omission, and resume the flow without restarting — an early sign of understanding the nature of the system. For lab members, this marks an architectural turning point. Many institutional systems simply don’t expose APIs that reflect how real workflows behave; the only faithful expression of the logic is the interface itself. As Laster puts it, “the UI was designed to be discoverable, learnable — even if it’s bad.” When agents learn that layer deeply enough to predict outcomes and recover from failures, they begin to function as a kind of synthetic API — a stable, programmatic surface over infrastructure that can’t be changed. That shift enables new architectural possibilities: Stable semantics over unstable UIs: Agents turn inconsistent behaviors — delays, re-renders, partial saves — into predictable patterns. Cross-system abstraction: Because the agent reasons about the workflow rather than the application, it can bridge systems never designed to interoperate. Incremental modernization: Institutions can update components gradually without breaking workflows; the agent absorbs transitional fragility. Preservation of institutional logic: Agents retain operational knowledge otherwise stored only in human memory — rules, sequences, dependencies no one has documented. This is not workflow automation. It is a new interface layer for the world’s oldest systems — an upgrade path that doesn’t require turning anything off. The work ahead Agentic AI will not replace the administrative tasks that structure daily life — booking vacations, renewing licenses, scheduling medical appointments — but it can help make them more efficient by allowing the evolution of systems once too fragile to change. That fragility is becoming more acute. The programmers who built the institutional backbone of the 1960s and ’70s — COBOL batch jobs, FORTRAN routines, mainframe integrations — are retiring. Few younger developers learn these languages, and the knowledge embedded in those systems grows harder to access each year. Critical workflows now run atop software whose inner workings fewer and fewer people understand. Agents offer a different form of continuity. By learning how these systems behave — not from lost documentation but from the systems themselves — they can preserve operational logic that would otherwise disappear. They can stabilize workflows sitting atop code no one can safely rewrite and carry forward institutional knowledge that would otherwise age out of the workforce. In that sense, “the work ahead” is twofold. There is the technical work of building agents that can meet the reliability these environments demand. And there is the human work that becomes newly possible when people are no longer trapped inside brittle interfaces — work grounded in judgment, coordination, empathy, and design rather than memorizing which field must be entered twice. Agents will not rebuild the foundations of our digital world. But they may rebuild something else: our notion that innovation comes only from replacement. By turning brittle systems into stable platforms, agents offer a new model of progress — one that grows from what already works.

Designing AI agents that know when to step back

Wed, 11 Mar 2026 16:00:00 GMT

Agentic AI is taking off, and for good reason. AI agents can now write code, conduct research, plan travel, handle customer service, and more. Yet amid the excitement about what AI agents can do, a key question has been neglected: how do we design the human side of the equation? That question is critical, because agentic AI isn’t just another feature to bolt onto existing products. It’s a fundamentally different kind of software that demands fresh thinking. Unlike traditional software, agentic AI can be proactive and conversational, sometimes even anthropomorphic. It doesn’t just respond to commands; it initiates actions and makes decisions autonomously. This capability is what makes agentic AI so useful, but it’s also what makes effective interactions hard to design. A central user-experience (UX) challenge is coordination: the interplay between what users do, what they experience, and what the AI is doing, both visibly and behind the scenes. Trust, control, and transparency are essential to the agentic-AI user experience, and they all depend on getting this coordination right. Here, we introduce a framework for thinking about human-AI coordination. We also offer a vocabulary for characterizing agentic experiences, including when the AI feels too absent, too intrusive, or appropriately calibrated. A framework for human-AI coordination One of the most critical decisions in AI UX design is how visible and interactive AI capabilities should be. Should users direct the agent step by step, let it act autonomously, or work somewhere in-between? And how should this change based on the task, the user’s expertise, and the current context? You can think of coordination along these three dimensions: Human involvement: how much effort and attention the user invests in directing or monitoring the AI; AI salience: how prominent the AI feels in the experience (for example, a conversational chatbot with a name and persona has high salience, autocomplete suggestions have lower salience, and AI-generated navigation menus and backend optimizations have little or none); AI activity: what the AI is doing, whether or not the user sees it. Coordination is about aligning these dimensions. When human involvement and AI salience are both low, coordination is light-touch. When they are high, coordination is more hands-on. The right balance is often somewhere in-between, with an awareness of what the AI is doing in the background. Three zones of coordination Rather than treating agent autonomy as a binary choice — a fully autonomous system or one with a human in the loop — it is practical to consider three “zones” of coordination. Done with me (mutually collaborative): User and AI work closely together across multiple phases — initiation, monitoring, updating, and completion. Imagine collaborating with an AI assistant on a complex document or research project, with frequent back-and-forth. AI salience and human involvement are both high. The user is very in the loop. Done for me (heavily automated): Tasks are handled by AI with minimal user input and oversight. The user initiates the task and reviews the output; most of the work happens out of view. An examples is an agent that researches competitors and delivers a summary report. The user is barely in the loop. Done under me (discreetly assisted): AI works in the background without announcing itself. The user may not even notice the assistance. Smart sorting, predictive text, and intelligently personalized content and navigation menus fall into this category. The AI quickly delivers outcomes users can assess and act on. The user is implicitly in the loop. These aren’t rigid categories but calibration points for designing and delivering the right level of coordination to users. The goal is to match coordination intensity to the specific user, task, and context, rather than defaulting to a single mode everywhere or assuming that an autonomous agentic system eliminates the need for thoughtful coordination. The rhythm of human-AI coordination Because both agents and users can work independently, coordination cannot be static. Workflows often move through multiple zones: high involvement during initiation, perhaps defining goals and constraints; lower involvement during execution; and then a spike at review and next steps. We visualize these shifts as “coordination curves” — a variation of user-journey mapping that shows how human involvement and AI salience rise and fall across a workflow. High-level curves reveal the overall shape of an experience. Looking beneath the surface exposes specific AI touchpoints, handoffs, and decision points, helping UX design teams collaborate on bringing adaptive agentic systems to life. As multiagent applications become more sophisticated, they enable longer, computationally intensive work such as research projects, complex analyses, and multistep workflows. These create valleys in the coordination curve: stretches where the AI operates independently and the user is minimally involved. These valleys require thoughtful design around notification, approval, monitoring, and auditing. More broadly, the UX layer must provide the transparency and controls needed to build trust, support adaptation and course correction, and ultimately deliver value. Case study: Adaptive coordination in practice We developed an approach called “responsive salience”, whereby an AI agent automatically adjusts its visibility and interaction intensity to match the context. The core insight is simple: in traditional software, most of the interface is static or deterministic. With agentic AI, behavior is nondeterministic, so a user’s needs for oversight can change moment to moment. A user who trusts an agent on a familiar task may prefer to be largely hands-off. In unfamiliar or high-stakes work, that same user may want more transparency, checkpoints, and tighter control. Rather than forcing users to toggle settings, responsive salience lets the system adapt automatically. In our prototype, a monitoring agent continuously evaluates signals including task complexity, perceived risk, and user comfort level. When trust appears low — for example, when the user is a beginner, or the workflow involves sensitive data — the system increases salience. It could do this by providing richer explanations, additional approval gates, and expanded transparency features. The user may then be notified of the change and, if needed, can override the agent’s choice. Once confidence recovers or the task ends, the salience settings quietly revert. Over time, the system can learn from user behavior through user feedback loops, refining how quickly salience adapts and how far it goes. The result is autonomy that stays aligned with context. Early testing with users validated the idea while revealing some clear tradeoffs. Preferences diverged sharply: some found high-salience modes exhausting (“I felt visually fatigued by the large amount of communication”), while others appreciated the guidance (“It gave me options for what I might want to ask next”). One participant expressed the desire for a middle ground: “I want some oversight on what the agent is planning before execution. … The high setting was too annoying because I had to approve everything.” These results underline that user preferences for autonomy versus control can vary substantially, even in similar tasks. Responsive salience offers a solution by dynamically adjusting whether a given task is done-with-me, done-for-me, or done-under-me. Tellingly, several participants did not notice responsive salience until we pointed it out. That suggests that when the system is well calibrated, dynamic coordination can feel seamless rather than intrusive. Coevolution with agentic AI Agentic AI represents a genuine shift in what software can do, but realizing its potential depends just as much on what humans do alongside it. The frameworks, protocols, and infrastructures for building agents are maturing fast. The UX layer needs to catch up. Coordination isn’t a one-and-done problem but a moving target. As users gain expertise, tasks change, and AI capabilities evolve, the optimal balance of user involvement and AI salience will change too. So the goal isn’t to find the perfect static design, as it might have been before generative AI, but to build systems and a shared vocabulary that evolve as we learn what works in practice. Agentic AI makes this both necessary and possible: its behavior can be unpredictable, so users and designers must adjust, yet the technology itself can also learn, adapt, and course-correct proactively. Teams that get this right won’t simply build more capable agents. They will build agents that people trust, adopt, and even enjoy collaborating with.

How AI is changing the nature of mathematical research

Mon, 09 Mar 2026 17:55:47 GMT

Modern AI coding tools have revolutionized software engineering, with developers now using AI assistants to write a substantial fraction of their code across a range of applications. As scientists studying the theory of machine learning, we’re already seeing a similar transformation in basic scientific methodology, especially for research of a mathematical nature. More precisely, AI tools are now able to develop and write rigorous mathematical proofs only from prompts providing high-level proof sketches. These proofs are written in longstanding “languages” for detailing mathematical arguments, in the same way that code is written in formal programming languages like Python. AI seems to have become proficient in both kinds of languages and their underlying logics. We came to this realization during a three-week period last summer, when we used agentic AI tools to write a mathematical paper that normally would have taken months. The 50-page paper describes and solves an optimization problem based on concepts from graph theory and machine learning. A typical prompt we would give the AI to set up the general framework for our paper looked like this: “Imagine a directed acyclic network of linear least-squares learning agents, each of which shares a common dataset but each of which sees only a different subset of the features.” A typical prompt for a theorem statement and proof went “We believe that if the network contains a sufficiently long chain of agents whose features cover the entire dataset, some agent in the chain should rapidly converge to the globally optimal linear model. The proof should use the fact that error monotonically decreases in the chain, which forces long sequences of agents to be multi-accurate with respect to each other’s features.” While incantations like these might be opaque to the casual reader, they all have precise, standard mathematical interpretations that the AI was aware of, due to its training, and it proceeded to translate informal intuitions into precise definitions and statements. This translation was imperfect (as discussed below) but resulted in a great first draft that could then be corrected and smoothed. To be clear, for this specific paper, we already knew the general outline of the proofs we had in mind. What AI did was to automate and dramatically speed up the process of filling in the missing details and writing them with formal precision. But more recently, we’ve written papers that we believe are substantially different and better than what we would have produced without AI assistance — in which the AI contributed key ideas that were crucial to the final results. It’s important to note that AI tools are advancing quickly, which makes the future difficult to predict. While their use has shown potential to produce faster and better research, it has also generated serious questions for those who care about the future of science and its relationship to the broader world. AI is changing research norms and workflows. This raises concerns about how to train future generations of scientists. Specifically, how can intuition and “good taste” in scientific research be developed when AI automates many of the steps that have historically been used to train young researchers? Peer review is another challenge: AI-generated research papers, quickly churned out at scale, highlight the limitations of peer review and modern-day publishing structures and also exacerbate already emerging challenges to incentives for scientific success. Without claiming to have answers or solutions to these concerns, we are personally living through them and will discuss each in turn. New ways of doing research One of our major takeaways from our summer research project was that working with proof-based AI tools is akin to collaborating with a smart, broadly educated but occasionally error-prone colleague. One can verbally sketch a mathematical argument to an AI agent as you might to a human collaborator, and the agent can turn that sketch into a formally written lemma or theorem along with its proof. Increasingly, AI agents can find proofs themselves without a sketch, especially when those proofs are "standard" in some areas of mathematics. This is more useful than it sounds: many kinds of arguments are "standard" in some field, but often one in which you, the human author, are not an expert. An advantage of AI tools is that they are conversant in an enormous number of areas of mathematics and other scientific disciplines. For example, in our case, along the way to proving one of our main results from the sketch we provided incrementally, the AI spontaneously proved a simple but useful lemma we were not aware of, which meaningfully simplified the argument we had in mind. The implications of this sort of creativity are exciting, especially for lowering the barrier to discovery: scientists without access to a diverse community of collaborators could also participate in cutting-edge research in ways that were previously impossible. Using these tools still requires caution and expertise, however. The proofs they generate are correct perhaps only three-quarters of the time. But when they’re wrong, if you can identify the errors, it is often possible to iterate to correctness and then continue along a promising path. If the errors remain uncorrected, trying to continue often takes you down a dead end. A 25% error rate is low enough to make the tools extremely useful to experts but high enough to sometimes devolve into "AI research slop" — polished-looking but ultimately flawed or uninteresting work — when used without care or discernment. The models, after all, still don’t know what is “interesting” or “useful.” We also noticed some recurring failure modes or “rabbit holes” that come from using the AI tools. While writing our paper, we asked the AI to generate a small, self-contained result, which it did perfectly in a matter of minutes, at which point we told it this subproject was completed. Nevertheless, during the coming days, the AI would spontaneously take the initiative to suggest returning to the topic, despite being repeatedly told not to do so unless asked. This was an irritating reminder that generative AI does not have perfect recall but only an incomplete summary or embedding of the context. While working on the code for the experiments to illustrate our theoretical findings, we found that the AI could alternate between writing large amounts of rather complex working code very rapidly and getting lost for hours on something trivial, like simply printing out which iteration of a loop was being executed. Training the next generation Historically, people earn expertise in the mathematical sciences through struggle as junior researchers. PhD students spend years working through the details of technical arguments to gain hard-won intuitions about when a proof approach is promising, when they are being led astray by a problem, or what constitutes a novel and interesting research direction. But these aspects of being a researcher are exactly what AI tools are “giving away”. If doctoral students can simply ask AI for proofs — which is extremely tempting, especially when it is in service of advancing research — how do they develop the experience and skill that, for now at least, are required to use AI tools productively in the first place? We may need to be more intentional about teaching these foundational skills to young researchers, perhaps adopting an advanced version of teaching arithmetic in grade school without the use of calculators. The straightforward recommendation is to require junior researchers to write papers “the old-fashioned way”, even when their work could be sped up by AI. Perhaps in a separate track, students would be trained to understand and work with emerging AI tools. This is an area of increasing importance that will likely require creative solutions. While we are strong believers that AI tools will do astounding things for science, it may be important to deliberately moderate their use in order to build researchers up to the point at which they can use them wisely and tastefully, not simply as short cuts to second-rate (or worse) research. These next-generation training challenges aren’t unique to scientists using AI. We see them across myriad fields, including engineering, customer service, law, writing, and design — really, any industry in which entry-level tasks, previously used to introduce young workers to a field, are now done using AI. To find creative solutions to this skills-training challenge, or to just better anticipate the changes at hand, it might be helpful to look at analogies across fields or over time. After high-level programming languages and compilers were widely introduced in the early 1960s, most software engineers no longer wrote machine code or assembly language, which provided direct instructions to the underlying hardware but were tedious to program. But the best programmers still understood enough about how compilers translated high-level languages into machine code to reason about correctness and performance. We hope that making it easier to construct and check technical arguments will let all researchers operate at a higher level of abstraction and “think bigger thoughts”. The culture we envision would emphasize taste, problem selection, and modeling skills and devalue technical wizardry for its own sake. Breaking and remaking peer review From our perspective, peer review is not only, or even primarily, a process to verify the correctness and quality of research. Rather, its purpose is to focus a scarce resource — the attention of the research community — in the right places. Science progresses as researchers build on each other’s work, but there is already too much work out there for anyone to keep up with. The publication process should help identify the most interesting and promising directions, so they can be more efficiently and thoroughly developed. How does AI influence this focusing of communal attention? AI tools make it much easier to produce work that looks polished and correct, dramatically lowering the barrier to generating “papers” that can be submitted to journals and conferences. Many of these papers are neither interesting nor actually correct — but discovering this requires significant effort from reviewers. This is straining an already overburdened machine learning publishing ecosystem struggling with tens of thousands of submissions per venue. We have seen that reducing the time and effort needed to produce "a paper" — not necessarily a good paper — is beginning to destabilize our existing institutions for peer review. The most recent iterations of AI and ML conferences have seen the number of submissions growing by large multiples, with a significant number of papers polished by AI, but ultimately of low quality, making it surprisingly far through the review process before being noticed and called out. This is a problem across research fields, partially because it’s creating a market for AI-generated papers. This has in turn engendered a countermarket for AI-assisted detection of AI-generated papers — much like the familiar technological arms races around things like spam and its detection, but with the integrity of scientific publication at stake, not just the filtration of annoying or fraudulent e-mails. As a short-term fix, AI-driven automated correctness checks (e.g., formal verification of mathematical proofs), tools for which are already being deployed in major conferences, could be valuable. Think of this as a form of unit testing for math instead of code. The aim is to filter out papers that have nontrivial errors, while focusing the job of the human reviewer on the important parts of science that they are best suited to evaluate: determining what we learn about the world from a new result, and how useful and interesting it is, rather than being drowned in the monotony of checking countless papers for technical correctness. Without a serious, community-wide re-evaluation of peer review, AI threatens to arrest scientific progress at the community level even as it accelerates it at the level of individual researchers. Looking ahead We think AI is bringing a sea change in scientific research methodology, training, and peer review; there is no hiding from what is coming. But there are opportunities to proactively adapt and ensure that AI-assisted research fulfills its promise. What will research look like at the end of next year? The year after that? We’ve seen more change in the past year than in the previous decade, so all we can confidently predict is "different". Our scientific institutions — peer review, publishing, graduate education — evolved over decades to match the constraints of human cognition and effort. Those constraints are shifting rapidly, and our institutions will need to shift with them. Our goal should be to steer toward a world where AI amplifies human creativity and insight, accelerates discovery, and expands who can participate in the research enterprise — while preserving the joy and rigor that make science worthwhile.

Intelligence isn’t about parameter count. It’s about time.

Wed, 25 Feb 2026 13:59:12 GMT

When we prompt a large language model (LLM) to solve a complex polynomial equation, it does not just return an answer but uses its “chain of thought” to work through a solution. In a sense, the LLM behaves like a computer, a machine that computes the solution. But this machine is quite unlike what Alan Turing described as a universal model of computation almost 90 years ago. In what sense can an LLM be thought of as a computer? Can it be universal, that is, able to solve any computable task, as a Turing machine does? If so, how does it learn this ability from finite data? Current theories of machine learning are of little help in answering these questions, so we need new tools. In an earlier Amazon Science post, we argued that AI agents and the LLMs that power them are transductive-inference engines, despite being trained inductively in the mold of classical machine learning theory. Induction seeks generalization, or the ability to behave on future data as one did on past data. To achieve generalization, one must avoid memorization, i.e., overfitting the training data. This works in theory, under the condition that both past and future data are drawn from the same distribution. In practice, however, such a condition cannot be verified, and in general, it doesn’t apply to high-value data in business, finance, climate science, and even language. That leaves us with no handle to explain how an LLM might learn how to verifiably solve a general computable task. With transduction, by contrast, one seeks to reason through past data to craft solutions to new problems. Transduction is not about applying past solutions in the hope that they generalize; rather, it is about being able to retrieve portions of memory that matter when reasoning through new solutions. In transduction, memorization is not a stigma but a value. Using the test data, along with memory, to craft a solution during transductive inference is not overfitting but adaptive, query-specific computation — i.e., reasoning. Inductive generalization is the kind of behavior one is forced to adopt when pressed for time. Such automatic, reactive behavior is sometimes referred to as “system-1” in cognitive psychology. Transduction instead requires looking at all data and performing query-specific variable-length inference-time computation — chain-of-thought reasoning in an LLM, whose length depends on the complexity of the query. Such deliberative behavior is often referred to as “system-2” and is what we wish to foster through learning. In this sense, transductive learning is a particular form of meta-learning, or learning to reason. In 1964, Ray Solomonoff described a universally optimal algorithm for solving any problem through transductive inference, if we assume that memory and time are unbounded: execute all programs through a Turing machine, then average the outcome of those that reproduce the observed data. That will give the universally optimal answer — but it will generally take forever. What if we want not just a universally optimal but a universally fast algorithm? In 1973 — in the same paper where he introduced the notion of NP completeness — Leonid Levin derived such an algorithm . Unfortunately, Levin’s so-called universal search is not viable in practice, nor does it help us understand LLMs; for one thing, it involves no learning. Nonetheless, Levin pointed to the critical importance of time when solving computational tasks. Later, in 1986, Solomonoff hinted at how learning can help reduce time. In a new paper, we expand on these ideas and show how reducing inference time induces a trained model to operate transductively — i.e., to reason. In striving to reduce inference time, the model learns not just the statistical structure of the training data but also its algorithmic structure. It can then recombine algorithmic methods it’s learned in an infinite number of ways to address arbitrary new problems. This insight has implications for how AI models are designed and trained. In particular, they should be designed to predict the marginal value of additional costs at inference time, and their training targets should include complexity costs, to force them to minimize time during inference. This approach to learning turns classical statistical learning theory on its head. In classical statistical learning theory, the great danger is overfitting, so the goal is to regularize the solution, i.e., to minimize the information that the trained model retains from past data (beyond what matters for reducing the training loss). With transductive inference, on the other hand, the goal is to maximize the information retained, as it may come in handy for solving future problems. The inversion of scaling laws LLMs’ performance gains in the past few years have come mostly from scaling: increasing the number of model parameters has improved accuracy on benchmark datasets. This has led many to speculate that further increasing the models’ parameter counts could usher in an age of “superintelligence”, where the cognitive capacities of AI models exceed those of their human creators. In our paper, we argue the opposite: beyond a certain complexity, AI models enter what we call the savant regime, where learning becomes unnecessary, and better performance on the benchmarks comes with decreased “insight”. At the limit is the algorithm Solomonoff described in 1964, where any task can be solved by brute force. If scale does not lead to intelligence, what does? We argue that the answer is time. It’s an answer with some intuitive appeal. The concept of intelligence is fundamentally subjective and environment dependent. But while intelligence is hard to characterize, its absence is less so. Being unable to adapt to the speed of the environment is one among many behaviors that we call traits of non-intelligence (TONIs). TONIs are behaviors whose presence negates intelligence however one wishes to define it. Many TONIs are timebound. Taking the same amount of (non-minimal) time and energy to solve repeated instances of the same task, to no better outcome, is a TONI. So is the inability to allocate resources commensurate to the goal, thus spending the same effort for a trivial task as for a complex one. Starting a task that is known to take longer than the lifetime of the universe to render any usable answer would be another TONI. Given this intuition, how do we quantify the relationship between intelligence and time in AI models? The first step is to assess the amount of information contained in the models’ parameters; then we can see how it’s affected by the imposition of time constraints. Algorithmic information The standard way to measure information was proposed by Claude Shannon in a landmark 1948 paper that essentially created the field of information theory. Shannon defined the information content of a random variable as the entropy of its distribution. The more uncertainty about its value, the higher the information content. On this definition, however, a given data sample’s information content is not a property of the sample itself; it’s a property of the distribution it was drawn from. For any given sample, however, there are infinitely many distributions from which it could have been drawn. If all you have is a sample — say, a string of ones and zeroes — how do you compute its information content? In the 1960s, Solomonoff and, independently, Andrey Kolmogorov, addressed this problem, with an alternative notion of information, algorithmic information, which can be used to characterize the information content of arbitrary binary strings. For a given string, one can write a program that, when run through some computer, outputs that string. In fact, one can write infinitely many such programs and run each through many computers. The shortest possible program that, run through a universal Turing machine, outputs the specific datum is a property of that datum. That program is the algorithmic minimal sufficient statistic, and its length is the algorithmic information (Kolmogorov-Solomonoff complexity) of that datum. In his 1948 paper, Shannon also defined a metric called mutual information, which quantifies the information that can be inferred about the value of one variable by observing a correlated variable. This concept, too, can be extended to algorithmic information theory: the algorithmic mutual information between two data strings measures how much shorter the program for generating one string will be if you have access to the other. Time is information If we don’t know the distribution from which a model’s training data was drawn, and we don’t know whether the model’s future inputs will be drawn from the same distribution, how can we quantify the model’s future performance? In our paper, we assume that most tasks can be solved by combining and transforming — in infinitely many possible ways — some ultimately finite, but a priori unknown, collection of methods. In that case, we can show that optimizing performance is a matter of maximizing the algorithmic mutual information between the model’s training data and future tasks. Finding the shortest possible algorithm for generating a particular binary string is, however, an intractable problem (for all but the shortest strings). So computing the algorithmic mutual information between a model’s training data and future tasks is also intractable. Nonetheless, in our paper, we prove that there is a fundamental relation between the speed with which a model can find a solution to a new task and the algorithmic mutual information between the solution and the training data. Specifically, we show that where h is the solution to the new task, D is the dataset the model was trained on, and I(h : D) is the algorithmic mutual information between the data and the solution. This means that, during training, minimizing the time the model takes to perform an inference task will maximize the algorithmic information encoded in its weights. Reducing inference time ensures that, even as models’ parameter counts increase, they won’t descend into the savant regime, where they solve problems through brute force, without any insight or learning. The value of time You may have noticed that the equation relating inference time to algorithmic information doesn’t specify any units of measure. That’s because even the value of “time” is subjective. A zebra drinking from a pond does not know a priori how long it will take to be spotted by a predator. If it lingers too long, it ends up prey; if it panics and leaves, it ends up dehydrated. Similarly, for an AI model, there is no single cost of time to train for and correspondingly no unique scale beyond which LLMs enter the savant regime. For some tasks, such as scientific discovery, the time constant is centuries, while for others, such as algorithmic trading, it’s milliseconds. We expect agents to be able to adapt to their environment, in some cases spawning smaller specialized models for specific classes of tasks, and even then, to provide users (who are part of an agent’s environment) with controls to adjust the cost of time depending on the context and domain of application. The cost of time is already (partially and implicitly) factored into the process of training LLMs. During pretraining, the cost of time is effectively set to a minimum value, as the model is scored on the output of a single forward pass through the training data. Fine tuning the model for chain-of-thought reasoning requires annotated data, whose high cost imposes a bias toward shorter “ground truth” reasoning traces. Thus, LLMs already reflect the subjective cost of time to the annotators who assemble the training sets. However, to enable the user to modulate resources at inference time, depending on the cost of the environment, models should be trained to predict the marginal value of one more step of computation relative to the expected final return. Furthermore, they need to be trained to condition on a target complexity, in order to learn how to provide an answer within a customer-specified cost or bound. There are growing efforts to teach models the value of time, so they can adapt to the tasks at hand (with or without human supervision). These are certain to yield a better bang-to-buck ratio, but the theory predicts that, at some point, factoring in the cost of time will actually improve absolute performance in new tasks. For verifiable tasks, learning to reason comes from seeking the shortest chain of thought that yields a correct (verified) answer. Ultimately, imposing a cost on time should not impair reasoning performance. A new paradigm for AI coding Connecting these ideas to modern AI requires rethinking what computation means. LLMs are stochastic dynamical systems whose computational elements (context, weights, activations, chain of thought) do not resemble the “programs” in classical, minimalistic models of computation, such as universal Turing machines. Yet LLMs are models of computation — maximalist models. They’re universal, like Turing machines, but in many ways, they’re antithetical, and they operate through entirely different mechanisms. It’s possible to “program” such stochastic dynamical systems using a two-level control strategy: high-level, open-loop, global planning and low-level, closed-loop feedback control. That strategy can be realized with AI Functions, an open-source library released this week as part of Amazon’s Strands Labs, a GitHub repository for building AI agents. An existing programming language can be augmented with functions from the library. These are ordinary functions, in the syntax of the language, but their bodies are written in natural language instead of code, and they’re governed by pre- and post-conditions. These enable high-level, open-loop planning and verification, before a single line of code is written by AI, and they engender an automatic local feedback loop if the AI-generated code fails to clear all conditions. Minimizing time, which translates into cost, is at the core of the design and evaluation of the resulting agents.

Why a 12-year-old forecasting paper has stood the test of time

Tue, 17 Feb 2026 14:00:00 GMT

Amazon Scholar Aravind Srinivasan coauthored a 2014 paper about forecasting civil unrest in Latin America, which won a test-of-time award at KDD 2025.

How academic collaboration delivers real-world security to Amazon customers

Wed, 04 Feb 2026 14:00:00 GMT

On July 16, 2018, Amazon distinguished scientist Byron Cook was giving a keynote at the Federated Logic Conference (FloC) at the University of Oxford, a computer logic gathering held every four years since 1996. In the keynote, Cook described how his team was using an open-source software tool called cvc (cooperating validity checker) to identify logic problems in code and fix them. Sitting in the audience was Stanford University professor Clark Barrett, who had been working on cvc for almost 20 years. Cvc had been developed to analyze verification problems encoded as satisfiability modulo theory (SMT) problems. SMT is a mainstay of formal methods — the use of automated reasoning to prove that a program or system will behave as intended. By applying SMT at scale, cvc can detect logical errors in code and in systems such as those used for authentication and access management. “I was kind of stunned. It was really exciting,” Barrett says. “And this really started with this exciting moment of realizing, Hey, our work is being used by Amazon.” The encounter between Cook and Barrett ultimately led to a years-long research collaboration that culminated in Barrett’s becoming an Amazon Scholar in 2023. Initially, Amazon provided small grants to Barrett’s lab at Stanford’s School of Engineering through the Amazon Research Awards program; those grew into larger funding commitments as the research progressed. This funding supported foundational research that — together with deep technical collaboration between the two teams — enabled the development of cvc5, the latest version of the open-source software. Cvc5 has delivered significant value for both Amazon customers and the broader industry, while simultaneously advancing academic research. As one example, cvc5 is used in Automated Reasoning checks, a new Amazon Bedrock feature that verifies natural-language content against organizational policies. It powers access-policy analysis tools, including Identity and Access Management (IAM) Access Analyzer, a service that helps customers securely manage access to AWS resources. More recently, Amazon has begun deploying cvc5 for specification analysis and test generation in Kiro, a new agentic development environment. Across these applications, cvc5 now processes approximately one billion solver calls every day, enhancing security, reliability, and durability for AWS customers. A meeting of minds Working with Barrett on the project is Robert Jones, a senior principal applied scientist at AWS who shared an advisor with Clark when both were Stanford PhD students. Also involved in the project over the years were many students and postdocs keen to test their skills. More than a few have since joined Amazon to develop new implementations and applications, extending work that began when they were student researchers. “What's really fun about it is that people who have just finished their PhD, for example, often bring fresh insight to long-standing research challenges because they're thinking about them in a different way,” Jones says. “And I find that the best part of collaboration is that different people tend to build different mental models for the same problem. When those come together, you often have new insight into how to think about the problem or how to map it to a different problem you already know how to solve.” A successful coupling of academic research and commercial funding can have great impact, but as Barrett points out, there needs to be a focus on achievable goals. It’s easy to get caught up in an interesting project idea that leads to a practical dead end, Barrett says. “If you're in your ivory tower, building your tools, and you don't have access to the real problems, it's very easy to build the wrong tool. And I've actually made this mistake,” he says. “You build a hammer, and then you go around looking for a nail, and you can't quite find anything that fits. You get excited about a particular approach but don't think about what that approach could be good for. So I actually now much prefer the opposite, where I go find a real problem, and then I take a step back and say, ‘What approach can we actually use to solve that?’” When you change code, he says, “Eighty percent of the time it does better, and 20 percent of the time it does worse. This is actually not so great in some contexts.” Sorting the wheat from the chaff is essential to producing robust and scalable code, he adds, and large-scale testing is needed to find and fix issues that can be inadvertently introduced as the code changes. Analyzing interactions at that level requires multiple minds, and the more the merrier, Jones says. The old adage “many hands make light work” is particularly useful when mixing public research and practical applications. “I really like to work on hard problems that require multiple people to solve. I enjoy the collaboration involved in science,” he says. “I've always found that more minds working on the same problem together are better than one.” Barrett and Jones agree that what makes this work is a willingness to see from both points of view — the scholastic and the commercial. Sometimes a pure research goal can have very beneficial results, sometimes not, but melding these two approaches together to address serious issues can deliver huge benefits. And communication is key, both agree. “One of the hard things about academia is knowing which problems are the most important to work on and how those problems might impact the real-world problems that are being encountered in industry,” Jones says. “Having the ability to be much more open about the kinds of problems that we're struggling with and Clark telling us about his research agenda helps both of us. It enables Amazon to indicate areas of interest, and it helps Clark understand concrete problems that we encounter day to day as we try to apply these tools and techniques in practice.”