the hidden cost of ai reasoning models: why benchmarking is getting more expensive