ArXiV ML/AI/CV papers summary

Theme 1: Geometric & Spectral Foundations of Learning

The frontier of machine learning is shifting from treating neural networks as “black-box” function approximators to viewing them as geometric systems. By leveraging spectral theory and topology, we can now analyze the “geometry of the latent space” to understand how structure emerges during training.

Spectral Dynamics: Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent and Neural Networks Provably Learn Spectral Representations for Group Composition provide rigorous mathematical foundations for how architecture dictates curvature and how networks naturally converge to irreducible representations.
Topological Interpretability: Learning Coherent Representations: A Topological Approach to Interpretability and IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension utilize geometric constraints and intrinsic dimensionality to ensure that neural representations mirror the underlying structure of the data.
Geometric Constraints: Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models and The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models reveal how abstract concepts like arithmetic are encoded as continuous, structured trajectories within the latent space.

Theme 2: Physics-Informed & Scientific AI

We are moving toward a “physics-to-physics” paradigm where AI architectures respect the conservation laws and continuous-time dynamics of the physical world, rather than merely predicting snapshots of data.

Symmetry & Equivariance: Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group, EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs, and Efficient Prediction of SO(3)-Equivariant Hamiltonian Matrices via SO(2) Local Frames bake physical symmetries directly into the operator learning process.
Continuous Dynamics: Physics-informed diffusion models in spectral space, Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing, and Martingale Neural Operators: Learning Stochastic Marginals via Doob-Meyer Factorization align model architectures with the continuous-time evolution of physical systems.
Scientific Discovery & Surrogates: A Geometric Lens on Physics-Aligned Data Compression, Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout, Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching, TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering, From Holo Pockets to Electron Density: GPT-style Drug Design with Density, and Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design demonstrate the transition from passive prediction to active, agentic discovery in the natural sciences.

Theme 3: Agentic Reasoning, Reliability, and Governance

As AI agents transition from passive assistants to active participants in high-stakes environments, the focus has shifted from raw capability to reliability, safety, and the “process-level” supervision of reasoning.

Reliability & Governance: Toward a Science of AI Agent Reliability, Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence, Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI, and What Benchmarks Don’t Measure: The Case for Evaluating Abstention Competence in Autonomous Agents establish frameworks for verifiable, auditable, and safe agentic behavior.
Process Supervision & Reasoning: AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation, Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories, UR$^2$: Unify RAG and Reasoning through Reinforcement Learning, InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning, CAPER: Clause-Aligned Process Supervision for Text-to-SQL, X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes, and Evaluating Relational Reasoning in LLMs with REL move beyond outcome-only evaluation to audit the reasoning process itself.
Skill Evolution: Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision, From Context to Skills: Can Language Models Learn from Context Skillfully?, Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization, PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft, ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning, SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents, EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning, and Inducing Reasoning Primitives from Agent Traces explore how agents can autonomously refine their own procedural knowledge and reasoning paths.
Tool Use & Safety: Synthesize and Reward – Reinforcement Learning for Multi-Step Tool Use in Live Environments, Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning, Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning, AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations, BraveGuard: From Open-World Threats to Safer Computer-Use Agents, Narrow Secret Loyalty Dodges Black-Box Audits, and Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI address the complex threat surface of tool-using, agentic systems.

Theme 4: Efficiency, Optimization, and Deployment

As models scale, efficiency is increasingly a matter of “hardware-aware” algorithm design, where the mathematical structure of the model is tailored to the specific constraints of the compute fabric.

Inference & Memory: KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem, WaterSIC: Information-Theoretically (Near) Optimal Linear Layer Quantization, FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs, Forget Attention: Importance-Aware Attention Is All You Need, Value-Aware Stochastic KV Cache Eviction for Reasoning Models, Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge, DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training, vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models, AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse, TGV-KV: Text-Grounded KV Eviction for Vision-Language Models, Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling, and Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics optimize the compute-accuracy trade-off.
Quantization, Pruning, & Merging: Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference, KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks, Pruning Deep Neural Networks via the Marchenko–Pastur Distribution, Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter, GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond, How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models, and Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning demonstrate that we can achieve significant performance gains by pruning redundancy and merging specialized adapters.

Theme 5: Alignment, Robustness, and Causal Discovery

Moving beyond correlation, researchers are using causal discovery and preference alignment to ensure models are robust to distribution shifts and grounded in human values.

Alignment & Safety: Constitutional On-Policy Safe Distillation, P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization, Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation, TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards, and Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling refine how we align models with complex, multi-faceted human intent.
Causal Discovery & Robustness: Your Autoregressive Model Already Reveals the Causal Graph, CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery, Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment, and Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing ensure models rely on causally relevant features.
Continual Learning: Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys and Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories address catastrophic forgetting through better memory indexing and consolidation.

Theme 6: Multimodal Understanding, Embodied AI, and Benchmarking

The integration of visual, audio, and physical world models is enabling agents to “see” and “act” in real-world environments, supported by increasingly sophisticated scientific benchmarks.

World Models & Embodiment: OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance, MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control, From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis, Coupled Local and Global World Models for Efficient First Order RL, See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence, The Agent’s First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios, and Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents push the boundaries of spatial and social intelligence.
Specialized Benchmarking: Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications, Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review, ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats, and MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation provide the rigorous evaluation tools necessary to measure progress in high-stakes, niche domains.