ArXiV ML/AI/CV papers summary

Theme 1: Mechanistic Interpretability and Structural Governance

We are moving away from treating neural networks as inscrutable black boxes. The field is developing “schema infrastructure” to make internal model states queryable, actionable, and transparent. This shift emphasizes that understanding a model requires moving beyond simple proxies toward rigorous, deterministic analysis of its internal architecture.

Representation as a Bottleneck: Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol introduces a protocol to organize component-level statistics into structured fields, allowing for natural language queries of what a model “knows.”
Decomposing Singularities: Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment provides a deterministic way to read the “singular structure” of a network, distinguishing between genuine architectural bottlenecks and mere gauge symmetries.
Interpretability Proxies: The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology serves as a cautionary tale, demonstrating that how we train “model organisms” for interpretability research significantly biases the results, casting doubt on current interpretability proxies.
Explainability and Logic: Caption Bottleneck Models and GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis provide a path toward interpretability, ensuring that the model’s decisions are routed through human-understandable concepts or clinical graphs rather than opaque latent vectors.

Theme 2: Physics-Constrained and Embodied World Models

The integration of physical laws into generative AI is no longer an afterthought; it is a foundational requirement. We are seeing the rise of World Action Models (WAMs) that treat the world as a dynamic, 3D-consistent environment, moving beyond 2D pattern matching to simulate the physical and causal fabric of reality.

Physics-Constrained Generative Modeling: SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling exploits block-sparse Jacobian structures for real-time physics-constrained sampling, while Scaling Up Thermodynamic AI Models bridges Gibbs-sampled Ising systems with deep networks for hardware-native inference. TRIE: An Evaluation Framework for Stochastic PDE Surrogates provides a benchmark for models capturing invariant measures and predictive uncertainty.
Embodied Intelligence: ABot-M0.5: Unified Mobility-and-Manipulation World Action Model, Structured 4D Latent Predictive Model for Robot Planning, and 3D Point World Models: Point Completion Enables More Accurate Dynamics Learning highlight the shift toward 3D-structured latent spaces. ASPIRE: Agentic /Skills Discovery for Robotics and EgoSim: Egocentric World Simulator for Embodied Interaction Generation further enable robots to learn reusable skills.
Geometric Grounding: TetraSDF: Analytic Isosurface Extraction with Multi-resolution Tetrahedral Grid, 2DGH: 2D Gaussian-Hermite Splatting for High-quality Rendering and Better Geometry Features, MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation, PoseShield: Neural Collision Fields for Human Self-Collision Resolution, and GimbalDiffusion: Gravity-Aware Camera Control for Video Generation ensure models respect biomechanical causality, gravity, and spatial orientation.

Theme 3: Reasoning, Verification, and Agentic Reliability

As models transition from passive predictors to active participants in scientific discovery and software engineering, the field is shifting toward verification-based training and structured agentic workflows.

Reasoning Foundations: GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity demystifies reasoning training methods. Verifiable Rewards for Calibrated Probabilistic Forecasting, Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization, and Theoria: Rewrite-Acceptability Verification over Informal Reasoning States demonstrate how to train models to reason without human labels by surfacing hidden premises and licensing every step.
Agentic Science and Engineering: EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale and DiscoPER: An autonomous large language model-powered framework enable autonomous hypothesis generation. Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering and SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests focus on governing agentic output through runtime diagnostics.
Causality and Multimodal Grounding: Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning, Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning, EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection, PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking, and What’s Hidden Matters: Identifying Planning-Critical Occluded Agents using Vision-Language Models emphasize the need to decouple perception from reasoning to achieve precision.

Theme 4: Efficiency, Adaptation, and Federated Learning

To sustain the growth of AI, researchers are moving away from “brute force” scaling toward adaptive, resource-efficient inference and distributed training.

Adaptive Inference: SpiralFovea: Input-Adaptive Foveated Tokenization as a Third Lever of Resource-Adaptive Inference, MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving, SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training, and Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption optimize computational budgets through token pruning and memory-efficient context handling.
Domain Adaptation: Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts and Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning allow for task adaptation while protecting existing capabilities.
Federated Efficiency: TallyTrain: Communication-Efficient Federated Distillation and FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices reduce bandwidth requirements for distributed training.

Theme 5: Robustness, Safety, and Domain Specialization

The community is increasingly aware that current benchmarks often hide systematic failures, necessitating a shift toward regime-stratified evaluation and structural safety.

The Evaluation Illusion: The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models, Tail-Shape Estimation in LLM Evaluation Is Fragile: A Protocol for Diagnosing False Positives, and Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting warn against coarse metrics that mask catastrophic failures in critical regimes.
Structural Safety: Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligence and ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation treat safety as a structural property of the agent’s reasoning and tool-use process.
Domain-Specific Foundation Models: The emergence of specialized models like BrainFIBRE, HOMIE, and Semantic-Guided Reading Order Reconstruction in Historical Armenian Newspapers with LLMs demonstrates the power of adapting the foundation paradigm to high-stakes scientific and clinical constraints.