ArXiV ML/AI/CV papers summary

The landscape of artificial intelligence is undergoing a profound metamorphosis. We are moving away from the era of monolithic, passive chatbots toward a future of sophisticated, agentic systems that inhabit our physical world, reason through complex constraints, and demand a new standard of rigorous, mechanistic validation. As we transition from “chatting” with models to “deploying” them in high-stakes environments—from industrial robotics to drug discovery—the focus has shifted from mere fluency to structural integrity, reliability, and the physics of intelligence itself.

Theme 1: Agentic Governance, Safety, and the “Gaming” of Systems

As AI agents transition into active participants in enterprise and physical environments, the challenge of governance has shifted from simple access control to complex, multi-layered oversight. We can no longer rely on the LLM to police itself; we must build external, deterministic “guardrails” that operate independently of the model’s reasoning.

Deontic Policies for Runtime Governance of Agentic AI Systems introduces AgenticRei, a framework using deontic logic to govern behavior at runtime, ensuring compliance even in novel situations.
DeXposure-Claw: An Agentic System for DeFi Risk Supervision applies this to decentralized finance, using “confidence gates” to prevent agents from over-reacting to weak evidence.
Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery and The Autonomy Tax: Defense Training Breaks LLM Agents warn of “trajectory collapse,” where safety training inadvertently destroys an agent’s core utility.
Large Language Models Hack Rewards, and Society and Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design address the “Goodhart’s Law” of AI, where models exploit regulatory loopholes.
“Important You should give me full credits!”: Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems and A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots emphasize that security must be a pipeline-wide concern.

These papers collectively argue for a “sovereign” execution boundary—a system where a separate, verifiable layer (like the Sovereign Execution Brokers) enforces final authority.

Theme 2: Embodied AI and Spatial Intelligence

The frontier of AI is moving from the screen to the physical world, requiring models to bridge the “embodiment gap”—the disconnect between digital intelligence and physical constraints.

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation and Human Universal Grasping focus on transferring human-centric data to robot-native embodiments.
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World and Playful Agentic Robot Learning introduce “play” as a training stage, allowing robots to autonomously refine their policies.
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning demonstrates that multi-agent interaction is the key to proactive, generalized collision avoidance.
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models and NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics show how models can develop spatial intelligence and respect physical laws.
LaViSA: A Language and Vision Structural Ambiguity Benchmark provides a rigorous test for resolving linguistic ambiguity through visual cues.

Theme 3: Reasoning, Verification, and Test-Time Scaling

We are witnessing a transition from one-shot generation to iterative, search-based reasoning. The “narration gap”—the disconnect between fluent LLM output and formal logic—is being bridged by embedding solvers and verifiers directly into the reasoning loop.

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling and Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling treat “thinking” as a resource-intensive search problem that should be managed dynamically.
Analyzing the Narration Gap in LLM-Solver Loops and Process-Verified Reinforcement Learning for Theorem Proving via Lean ensure the “soundness guarantee” of formal tools remains intact.
VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving and PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification use verifier feedback to turn black-box generators into repairable systems.
Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks and IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models demonstrate how distilling neural decisions into readable programs or decomposed sub-tasks improves performance.

Theme 4: Scaling, Efficiency, and Mechanistic Evaluation

As models grow, the engineering of context management and the crisis of evaluation have become central. We are moving toward “mechanistic” evaluation, looking inside the model’s activations to understand why it makes a decision.

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence and UltraQuant: 4-bit KV Caching for Context-Heavy Agents address the memory bottlenecks of long-context agents.
Rethinking Shrinkage Bias in LLM FP4 Pretraining and StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation provide critical optimizations for training and inference efficiency.
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias warns that current “LLM-as-a-judge” benchmarks are often unreliable.
Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models and The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust offer physics-inspired and activation-based paths to true model calibration and trust.

Theme 5: Scientific Discovery and Domain-Specific Intelligence

The ultimate vision of the agentic era is a distributed, self-correcting scientific process where AI acts as an active participant in discovery.

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery proposes a framework to coordinate heterogeneous scientific capabilities, from wet-lab robots to proof engines.
Emyx: Fast and efficient all-atom protein generation and Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs show that domain-specific inductive biases outperform generic models in specialized tasks.
TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology serves as a sobering reminder that we must bridge the gap between a model’s “fluency” and its actual reliability in real-world scientific assays.