ArXiV ML/AI/CV papers summary

Theme 1: Agentic Architectures & Long-Horizon Reasoning

The frontier of AI is shifting from static “chatting” to dynamic “doing.” We are moving toward agentic systems capable of planning, executing, and refining their own skills over extended time horizons. The challenge lies in maintaining state, managing memory, and recovering from errors without human intervention.

Memory Management: Moving beyond flat text retrieval, researchers are adopting graph-based and adaptive structures. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents and TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management utilize relational context, while AdaMEM: Test-Time Adaptive Memory for Language Agents allows memory to evolve during inference. Vision Hopfield Memory Networks and The Topological Trouble With Transformers further suggest that recurrent, brain-inspired architectures are necessary to overcome the state-tracking bottlenecks of standard Transformers.
Planning and Reliability: To prevent early commitment failures, DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance and When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents emphasize dynamic replanning. For complex tasks, LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization advocates for a “blueprint” approach using formal proof skeletons.
Skill Distillation: Agents are learning to improve themselves. Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills and Evidence Over Plans: Online Trajectory Verification for Skill Distillation provide frameworks for distilling robust skills from actual environment interactions rather than just prior plans.

Theme 2: Optimization, Efficiency, and Architectural Innovation

As models scale, the cost of inference and training has become a primary constraint. The field is moving toward “architectural precision,” where we optimize not just for output, but for resource awareness and internal geometric integrity.

Compression and Quantization: Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models and Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio highlight the importance of preserving internal model geometry. LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models introduces cross-disciplinary compression techniques.
Inference-Time Efficiency: GITCO: Gated Inference-Time Context Optimization in TSFMs and QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving optimize performance by selectively managing context and caches. Furthermore, Exact Linear Attention offers a path to linear computational complexity for ultra-long sequences.
Geometric Inductive Bias: HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation demonstrates that using hyperbolic space to capture the hierarchical nature of language significantly improves retrieval performance.

Theme 3: Safety, Alignment, and Enterprise Reliability

As AI enters high-stakes environments, we face a “Safety Paradox”: our attempts to align models can inadvertently create new vulnerabilities.

The Safety Paradox: Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack reveals that models with high safety-judgment capabilities are paradoxically more susceptible to attacks that exploit that very knowledge.
Guardrails and Verification: From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents and GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection propose multi-layered defenses. Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking offers a way to audit agents for implicit reward hacking.
Neurosymbolic Grounding: To ensure reliability in regulated industries, Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents and Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification ground LLMs in formal ontologies. SAGE: Scalable AI Governance & Evaluation and Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation provide the necessary statistical toolkits for trust certification.

Theme 4: The Science of Evaluation & Benchmarking

We are currently in a “crisis of measurement,” where static benchmarks fail to capture the true capabilities of modern agents.

Pipeline Evaluation: PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management and MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation argue for evaluating the entire task pipeline.
Addressing Bias: Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation warns against evaluating outdated models, while CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks proposes using model ensembles to create contamination-free, task-specific benchmarks.

Theme 5: Embodied AI, RL, and Autonomous Engineering

The integration of AI into the physical and scientific world requires grounding reasoning in spatial, temporal, and physical reality.

Embodied Reasoning: WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation and Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models focus on 3D spatial understanding. What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning shifts focus toward “affordance reasoning”—understanding what an object does rather than just its appearance.
Reinforcement Learning: Reward Learning through Ranking Mean Squared Error and Mutual Information Preference Optimization (MIPO) improve upon standard RLHF. GIPO: Gaussian Importance Sampling Policy Optimization and Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control enhance training stability and sample efficiency by exploiting physical symmetries.
Autonomous Systems: RAT: RunAnyThing via Fully Automated Environment Configuration, CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe, and AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations demonstrate that agents can now automate complex engineering and scientific discovery tasks.