Comparing Performance of OpenAI GPT-4 and Microsoft Azure GPT-4

Russell Proud

Co-Founder Decided.AI

Introduction

If you follow online discussions around OpenAI’s API and GPT models, you have undoubtedly come across people discussing the response times / performance of the OpenAI API and that during peak times, performance takes a hit. If you’re building a business that is using any of the OpenAI models, variations in performance and just overall low performance will likely be a concern for you.

OpenAI and Microsoft both provide API services to utilize OpenAI’s GPT models. There are variations in their terms of services, and for most businesses, utilizing GPT via Microsoft Azure provides additional privacy controls that aren’t available when utilizing OpenAI’s API. Beyond that, the models are meant to behave the same way (assuming you’re using the same version of the model on both platforms). Though the biggest question in my mind, and one that I’ve seen anecdotal evidence for, is, do they provide the same performance.

Within this article (short as it is), we will be evaluating the performance of OpenAI’s vs Microsoft’s API when utilizing GPT-4.

Testing Methodology

We will be following the below conditions for testing;

We will use OpenAI’s Python SDK. Specifically, version 0.27.8
All tests will be run from the same machine, WSL2 via Windows 11
We will measure and account for latency to each endpoint and remove this as a factor in the results
For Azure, we will be using the Australia region
For OpenAI, we cannot elect a region and will be using their standard endpoint
We will execute each test 10 times against each providers API
We will record the individual results of each test run and aggregate the results for measurement
The model temperature will be set to 0.0 for both services

NOTE: This is not a scientific approach to testing. It may not account for all variations between the services and as such, this should be used as a guide only. Your experiences may differ.

We will complete a number of tests against each service to measure performance of different types of prompts and message chains:

Test Name	Description
No system prompt, single message	No custom system prompt will be sent, just a single basic message payload
No system prompt, 10 message thread	No custom system prompt will be sent, a thread of 9 messages will be sent
Custom system prompt, single message, technical question	Customised system prompt for a technical role and a single message thread
Custom system prompt, 10 message thread, technical question	Customized system prompt for a technical role and a thread of 9 messages
Data Extraction and JSON Generation	No system prompt, a single message asking GPT-4 to extract items from the text and return a JSON list

The Results

Pictures are worth a thousand words.

As is clearly shown in the image above (and further highlighted in the below), the performance difference is quite extreme between Azure and OpenAI. On average, GPT-4 via OpenAI API was 2.8 times slower than Microsoft Azure.

The amount of tokens that needed to be generated in the response can impact the time to respond, if one service was providing significantly longer responses, that will result in a longer completion time. The below image displays the average tokens generated per second for each service. Again, Azure significantly outperforms OpenAI by a factor of 2.77 times.

One item of note here was, even with temperature of 0 on both services, OpenAI returned variations in the response and number of tokens generated. Azure was far more consistent in its response being the same each time (this in itself is worthy of note).

So, we’ve looked at the aggregate of all results and compared, what about the specific tests, what impact do the variances in the messages sent, instructions, response format and length have?

The variance in tokens generated per second is most pronounced in the test “Custom system prompt, 10 message thread, technical” with Azure generating tokens per second at a rate of 3.54 times more than that of OpenAI.

The smallest variance was in the “Custom system prompt, single message, technical” test, with Azure generating tokens at a rate of 2.03 times faster than that of OpenAI.

This article isn’t intended to go into why this variation may exist and why we see different variations in different tests, and honestly, it’s probably beyond the capability of us to be able to decipher the reasoning. The end result is, there are variations in completion times for each service, which differ depending on the length, content, response request etc.

The aggregate results of each test are below if you’re interested in the details.

File could not be opened. Check the file's permissions to make sure it's readable by your server.

Conclusion

We were taken back by the performance difference between OpenAI and Azure. I suspected from our experience that there were variations, but not to the level that we have seen here. For any business that is looking to use GPT-4 or other OpenAI models internally or as part of a product, it is clear that if performance is a consideration, then Azure is the better service to use.

The Details

All test prompts and results can be found on GitHub.

About the Author

Russell Proud is the co-founder of Decided.ai, an Conversational Search and Discovery platform for eCommerce retailers. Russell has spent the last 20 years building technology and services and provides consulting services for companies in AI, Search, Discovery, LLMs and other technical areas.

Recent Insights

Spotlight: Search Modernization Improves Agentic AI Platforms like CrewAI and Copilot

Explore how search modernization is transforming agentic AI platforms like CrewAI and Copilot — boosting performance, accuracy, and agent autonomy

Introducing Quintus: AI-Native Investigative Intelligence

Modern Investigations Require Modern Tools From courtrooms to compliance offices to police precincts, investigative professionals are drowning in digital evidence. Documents, emails, chats, images, and recordings pile up across disconnected systems, while the demand for speed, accuracy, and defensibility only grows. Traditional review tools can’t keep up. They were built for smaller data sets and linear workflows, not the scale

Why PostgreSQL Search Isn’t Enough: A Case for Purpose-Built Retrieval Systems

Instacart’s recent blog posts and InfoQ coverage paint a picture of a simplified, cost-effective search architecture built entirely on PostgreSQL. It’s a clever consolidation — but also a cautionary tale. For most organizations, especially those with complex catalogs, high query diversity, or real-time ranking needs, this approach is not just suboptimal — it’s misleading. Postgres is a relational database, not

Trusted Advisor

Go Further with Expert Consulting

Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.