All posts tagged: openai o3

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

FrontierMath accuracy for OpenAI’s o3 and o4-mini compared to leading models. Image: Epoch AI The latest results from FrontierMath, a benchmark test for generative AI on advanced math problems, show OpenAI’s o3 model performed worse than OpenAI originally stated. While newer OpenAI models now outperform o3, the discrepancy highlights the need to scrutinize AI benchmarks closely. Epoch AI, the research institute that created and administers the test, released its latest findings on April 18. OpenAI claimed 25% completion of the test in December Last year, the FrontierMath score for OpenAI o3 was part of the nearly overwhelming number of announcements and promotions released as part of OpenAI’s 12-day holiday event. The company claimed OpenAI o3, then its most powerful reasoning model, had solved more than 25% of problems on FrontierMath. In comparison, most rival AI models scored around 2%, according to TechCrunch. SEE: For Earth Day, organizations could factor generative AI’s power into their sustainability efforts. On April 18, Epoch AI released test results showing OpenAI o3 scored closer to 10%. So, why is there such a big difference? …

OpenAI’s New AI Models o3 and o4-mini Can Now ‘Think With Images’

OpenAI’s New AI Models o3 and o4-mini Can Now ‘Think With Images’

OpenAI’s CEO Sam Altman. Image: Creative Commons OpenAI has rolled out two new AI models, o3 and o4‑mini, that can literally “think with images,” marking a big step forward in how machines understand pictures. These models, announced in an OpenAI press release, can reason about images the same way they do about text — cropping, zooming, and rotating photos as part of their internal thought process. At the heart of this update is the ability to blend visual and verbal reasoning. “OpenAI o3 and o4‑mini represent a significant breakthrough in visual perception by reasoning with images in their chain of thought,” the company said in its press release. Unlike past versions, these models don’t rely on separate vision systems — instead, they natively mix image tools and text tools for richer, more accurate answers. How does ‘thinking with images’ work? The models can crop, zoom, rotate, or flip an image as part of their thinking process, just like humans would. They’re not just recognizing what’s in a photo but working with it to draw conclusions. …

OpenAI’s o3 Model Claims Human-Level Intelligence on Benchmark, But It Might Not Be That Smart

OpenAI’s o3 Model Claims Human-Level Intelligence on Benchmark, But It Might Not Be That Smart

OpenAI unveiled the reasoning-focused o3 series of artificial intelligence (AI) models last month. During a live stream, the company shared the benchmark scores of the model based on internal testing. While all of the shared scores were impressive and highlighted the improved capabilities of the successor to o1, one benchmark score stood out. On the ARC-AGI benchmark, the large language model (LLM) scored 85 percent, beating the previous best score by a 30 percent margin. Interestingly, this score is also on par with what an average human scored on the test. OpenAI Scores 85 Percent on ARC-AGI Benchmark However, just because o3 scored such a high score on the test, does it mean its intelligence is equal to that of an average human? This would be easier to answer if the AI model was released in the public domain and we could test it out. Since OpenAI has not disclosed anything about the model’s architecture, training techniques, or datasets, it is difficult to conclusively claim anything. There are certain things that we do know about the …

o3 Model Wraps 12 Days of Announcements

o3 Model Wraps 12 Days of Announcements

The next step for OpenAI’s reasoning models is o3, a model previewed on Dec. 20. o3 and its smaller cousin, o3-mini, outperformed o1 in coding, math, science, and ‘conceptual reasoning’ tests designed to assess human-like intelligence and research applications.  ‘Reasoning’ includes a safety feature called deliberative alignment, in which the model uses a “chain of thought” to prevent users from jailbreaking or tricking it into bypassing safety measures. Meanwhile, Google’s Gemini 2.0 Flash Thinking Experimental model treads similar ground to OpenAI o1’s reasoning capabilities. More must-read AI coverage ‘12 Days of OpenAI’ brings new tools and new generative AI functionality The o3 announcement came at the close of OpenAI’s “12 Days of OpenAI” campaign, a holiday season series of product updates. These announcements, from Dec. 5 to Dec 20 (excluding weekends), showcased new features for OpenAI’s generative AI tools, with some available now and others still in testing. Day 1: The $200 ChatGPT Pro and o1 updates On Dec. 5, OpenAI introduced a new subscription tier for ChatGPT: the Pro plan. For $200 per month, …