Tag

#benchmarking

14 articles

Validating Distributed LLM Serving Benchmarks with NVIDIA srt-slurm, SLURM Recipes, Parameter Sweeps, and Pareto Analysis

NVIDIA's srt-slurm framework simplifies distributed LLM serving benchmarking using SLURM, enabling reproducible workflows and advanced performance analysis.

Jul 215

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

Learn how to simulate and evaluate AI agent performance under varying compute budgets to better assess true capabilities, inspired by findings from the UK's AI Security Institute.

Jul 329

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Learn to build a code agent evaluation system that detects reward hacking in benchmarking, where agents retrieve known fixes instead of deriving solutions.

Jun 2629

AI search agents often confirm what they already know instead of actually researching the web

Learn to build a time-based benchmarking tool to evaluate whether AI search agents actually research the web or just confirm pre-trained knowledge.

May 3068

Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field

Learn how to set up a benchmarking framework to evaluate AI coding agents like Claude Code and GPT-5.5, similar to industry benchmarks used in 2026.

May 1456

Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

Meta AI has launched NeuralBench, a unified open-source framework for benchmarking NeuroAI models using the largest EEG benchmark to date, covering 36 tasks and 94 datasets.

May 963

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

Learn how to set up and run a basic benchmark test for agentic reasoning using Python and Hugging Face Transformers. This tutorial teaches you to evaluate AI agents' ability to handle real-world tasks.

Apr 2568

Agent skills look great in benchmarks but fall apart under realistic conditions, researchers find

Learn how to build AI agents with modular skills using LangChain and OpenAI, and understand why these agents often fail in realistic conditions despite strong benchmark performance.

Apr 12101

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

This article explains how to implement NVIDIA's Transformer Engine with mixed-precision, FP8 support, benchmarking, and fallback execution for optimizing transformer model performance.

Apr 680

AI benchmarks systematically ignore how humans disagree, Google study finds

This article explains how human disagreement in AI benchmarking can lead to unreliable performance metrics and why current practices need to evolve to account for annotation variability.

Apr 4121

tech

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel pursue different strategic paths in AI hardware competition.

Apr 2138

AI models confidently describe images they never saw, and benchmarks fail to catch it

AI models like GPT-5 and Gemini 3 Pro can confidently describe images they've never seen, and current benchmarks fail to detect this issue. A Stanford study highlights the dangers of AI hallucinations and calls for new evaluation methods.

Mar 3097