ArXiV ML/AI/CV papers summary
Theme 1: Efficient Attention and Sequence Modeling
The quest to scale Transformers to longer contexts while maintaining computational efficiency is a central challenge in modern AI. Standard dot-product attention, with its $\mathcal{O}(N^2)$ complexity, remains a significant bottleneck. Recent developments focus on replacing this dense interaction with more efficient, probabilistic, or structured alternatives:
- Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing introduces a probabilistic approach that maps queries and keys to a shared latent space, reducing memory scaling to $\mathcal{O}(NK)$.
- Hierarchical Attention via Domain Decomposition draws inspiration from numerical analysis, specifically Schwarz domain decomposition, to create a hierarchical attention mechanism that balances local and global information propagation.
- Attention as Frustrated Synchronization offers a fascinating theoretical perspective, framing attention as a system of frustrated oscillators, where the “computation” occurs in the structured departures from perfect synchronization.
- Beyond Similarity: Temporal Operator Attention for Time Series Analysis argues that standard softmax attention is ill-suited for time-series dynamics because it relies on convex combinations, and proposes “Temporal Operator Attention” to allow for signed, oscillatory transformations.
- Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models and S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices demonstrate that state-space models can be aggressively compressed for edge deployment without sacrificing the long-range dependency capabilities that make them competitive with Transformers.
Theme 2: Agentic World Modeling and Embodied Reasoning
The frontier of AI is shifting from passive text generation to active, goal-oriented interaction with the physical and digital world. This requires “world models”—the central substrate for agents that must navigate, manipulate, and reason about their environments.
- Cosmos 3: Omnimodal World Models for Physical AI introduces a unified mixture-of-transformers architecture that subsumes vision-language models, video generators, and world simulators into a single backbone.
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond provides a comprehensive synthesis of over 400 works, establishing a “levels x laws” taxonomy for future AGI.
- MagicSim: A Unified Infrastructure for Executable Embodied Interaction and Kairos: A Native World Model Stack for Physical AI emphasize the need for deterministic, batched runtimes that allow agents to simulate and evaluate their own experiences.
- DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack addresses the “systems problem” of agent performance, providing a single runtime that spans from foundation-model decoding to physics-based control.
- ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI serves as a vital reality check, moving beyond “shortcut learning” to evaluate genuine spatial reasoning and causal understanding in embodied agents.
- Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion, ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model, and FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs represent a shift toward models that understand underlying physical dynamics rather than just generating pixels.
- Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking and DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation demonstrate how foundation models can be adapted to new physical bodies with minimal data.
Theme 3: Agentic Reasoning, Planning, and Optimization
As agents take on complex, multi-step tasks, the focus has moved toward long-horizon planning, self-evolution, and inference-time efficiency.
- Position: Modular Memory is the Key to Continual Learning Agents and From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents explore how agents accumulate knowledge and maintain auditability.
- Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving (PASTE) and CEO-Bench: Can Agents Play the Long Game? address the challenges of latency and sustained, adaptive progress in long-horizon tasks.
- CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly demonstrates how agents can iteratively revise their own scaffolds based on execution feedback.
- How Inference Compute Shapes Frontier LLM Evaluation argues that capability is a function of inference-time compute, while Models Take Notes at Prefill: KV Cache Can Be Editable and Composable and PreAct: Computer-Using Agents that Get Faster on Repeated Tasks introduce methods to compile agent runs into state-machine programs for faster execution.
Theme 4: Reinforcement Learning and Post-Training Optimization
The shift toward Reinforcement Learning with Verifiable Rewards (RLVR) is essential for eliciting reasoning, though it introduces challenges like policy entropy collapse and credit assignment.
- Self-CTRL: Self-Consistency Training with Reinforcement Learning aligns self-explanations with behavior.
- Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards and STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability focus gradients on pivotal reasoning steps.
- DiPOD: Diffusion Policy Optimization without Drifting Apart ensures stability in diffusion policy gradients.
- LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents automates hyperparameter tuning using agentic search.
Theme 5: Trustworthy AI, Formal Verification, and Safety
Ensuring reliability in high-stakes environments requires moving from post-hoc patching to design-time verification and robust safety protocols.
- TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation, Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory, and EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning demonstrate the power of formal languages in verifying agent workflows.
- Decidable By Construction: Design-Time Verification for Trustworthy AI and Learning-Infused Formal Reasoning: From Contract Synthesis to Artifact Reuse and Formal Semantics propose frameworks for correctness before training.
- The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs and Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions address alignment misfires and transparency.
- Rift: A Conflict Signature for Deception in Language Models, An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios, BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?, First, do NOHARM: towards clinically safe large language models, and Mental Health AI Safety Claims Must Preserve Temporal Evidence highlight the critical need for robust safety evaluations that account for deception and temporal accumulation of harm.
Theme 6: Mechanistic Interpretability and Model Merging
Understanding the internal geometry of neural networks is critical for controlling models and combining their capabilities.
- Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging and PACT: Preserving Anchored Cores in Task-vectors for Model Merging explore the difficulties of merging RL-trained models.
- Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns, From Mechanistic to Compositional Interpretability, Jacobian Scopes: token-level causal attributions in LLMs, and Rational Sparse Autoencoder provide tools for interpreting model activations.
- SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior warns that clamping features is often insufficient for safety.
- LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models and Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones focus on efficient model compression and hardware-aware design.
Theme 7: Scientific Machine Learning and Physics-Informed Models
Integrating physical laws into machine learning architectures enables better generalization and interpretability in scientific discovery.
- ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets and Optimal scenario design for climate emulation provide physically-grounded benchmarks.
- Trainable Photonic Measurement for Physics-Informed PDE Learning and KANELÉ: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation explore hardware-aware architectures for solving PDEs.
- A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks bridges fluid mechanics and neural network optimization.