ArXiV ML/AI/CV papers summary

This collection of research marks a profound transition in the evolution of artificial intelligence. We are moving beyond the “black-box” era of simple pattern matching and next-token prediction into a future of Agentic, Simulative, and Verifiable AI. The focus has shifted from raw scale to the structural, geometric, and physical grounding of intelligence, ensuring that models operate not just with statistical plausibility, but with logical and physical integrity.

Theme 1: Mechanistic Interpretability and Structural Alignment

To move beyond surface-level performance, we must understand the “internal life” of neural networks. These papers probe the internal geometry of models to ensure they are not just guessing, but operating on stable, interpretable representations.

Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts identifies “sharpness” in categorical readouts as the geometric culprit behind text generation collapse.
Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images utilizes sparse autoencoders to untangle concepts forced into the same latent space, recovering geometric fidelity.
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models provides a formal framework to track how computation evolves across layers, preventing measurement drift from obscuring the model’s true logic.

Theme 2: Agentic Reasoning and Self-Evolution

The next generation of AI is not merely “trained”; it is “coached” through iterative loops that prioritize reasoning quality over data volume. These systems are designed to plan, verify, and improve themselves autonomously.

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators treats workflow generation as a meta-learning problem.
INFUSER: Influence-Guided Self-Evolution Improves Reasoning uses an “influence score” to ensure a generator-solver curriculum actually improves reasoning.
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents demonstrates that diagnosing failed trajectories is more efficient than success-only training.
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents and One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution highlight the necessity of self-correction and multi-hypothesis testing.
The Consistency Dilemma in LLMs: Generator-Evaluator Agreement and Vulnerability to Mistakes warns that internal coherence must not be mistaken for truth.

Theme 3: Embodied Intelligence and Physical Grounding

AI is stepping out of the server room and into the physical world. This requires “World Models” that understand 3D space, temporal dynamics, and the laws of physics.

Orca: The World is in Your Mind shifts the paradigm toward “Next-State-Prediction.”
ForgeDrive: Bidirectional Cross-Conditioning for Unified Visual-Action Generation in Autonomous Driving and StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation intertwine action generation with visual prediction.
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance and MIRTH: Mutual-Information Reasoning with Temporal Hubs for Vision-Language-Action Agents emphasize the need for 3D spatial awareness.
EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning introduces a “plan-then-verify” paradigm to reduce hallucinations.
OpenLife: Toward Open-World Artificial Life with Autonomous LLM Agents explores emergent social structures in autonomous agents.
LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents and A Self-Evolving Agentic System for Automated Generation and Execution of Biological Protocols provide safety frameworks for physical laboratory execution.

Theme 4: Scientific Machine Learning and Inverse Rendering

By integrating physical laws into machine learning, we are transforming AI from a statistical curve-fitter into a simulator of the universe.

ReactionAtlas: Ab origine exploration of chemical reaction networks with machine learning and Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization automate scientific discovery.
The HydroGym Reinforcement Learning Platform for Fluid Dynamics demonstrates RL agents discovering robust physical principles.
Diffusion-Based Material Regularization for Physics-Based Inverse Rendering and AEGIR: Modeling Area Emitters for Indoor Inverse Rendering using Gaussian Splatting use diffusion as a kernel for physics-based optimization.
PIAvatar: Physically Interactive Avatars via Deformation Gradient Decoupling and OVOW: One Video, One World: Turning Monocular Video into Physical 4D Scenes create simulation-ready 4D assets.
TerraDiT-$\Omega$: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive and AC3S: Adaptive Conditioning for 3D-Aware Synthetic Data Generation inject geometric priors into generative models.
GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis and WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis use 3D primitives for stable view synthesis.

Theme 5: Governance, Safety, and Verifiable AI

As systems gain autonomy, safety must be a structural constraint rather than an afterthought. We are moving toward “compliance-by-design” and formal verification.

AgentBound: Verifiable Behavioral Governance for Autonomous AI Agents uses cryptographic receipts to bind actions to policies.
ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries enforces compliance at the architectural level.
Incentive Aware AI Regulations: A Credal Characterisation provides a framework for market-based regulation.
Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization and Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics bridge natural language and formal mathematical verification.
AI Transparency: Governance Compliance or Stakeholder Requirements? critiques the “Transparency Illusion.”
Security–Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense and Containment Verification: AI Safety Guarantees Independent of Alignment focus on robust containment.
RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents and Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework shift evaluation toward process discipline.

Theme 6: Efficiency and Domain-Specific Intelligence

The path to super-intelligence requires extreme efficiency and specialized knowledge, moving away from brute-force computation toward “slow thinking” and vertical foundation models.

Hierarchical Global Attention (HGA), Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation, and RhinoVLA Technical Report optimize models for edge hardware.
Reasoning-aware Speculative Decoding for Efficient Vision-Language-Action Models in Autonomous Driving, CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts, and Reasoning in machine vision by learning fast and slow thinking implement dual-process “fast and slow” reasoning.
TotalFM: An Organ-Separated 3D-CT Foundation Model, M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding, MuSViT: A Foundation Vision Model for Sheet Music Representation, and Learning to Decipher from Pixels: A Case Study of Copiale demonstrate the power of domain-specific foundation models.