arc-agi benchmark nears solution — but reveals critical flaws in ai testing