Table of contents
- Generating end-to-end tests with LLMs isn't straightforward
- Rethinking the problem of end-to-end test generation
- How the Model Context Protocol (MCP) changes the end-to-end testing game
- What is Playwright MCP?
- How to transform your LLM into a Playwright test generator
- Should you generate tests with Codegen or AI?
- Are AI-generated end-to-end tests the holy grail?
When I started using Playwright, there was a single command that blew me away. I immediately became (and still am) a huge Playwright Codegen fanboy. Playwright's codegen
command opens up a browser window, and whatever you do in this window will be recorded. Navigating URLs, clicking links, and filling out form elements—the Playwright inspector records all your actions and generates a Playwright test for you. Magic!
Of course, relying on the Playwright recorder isn't perfect. The tool records actions line by line, so there won't be extracted variables, and your test script probably won't follow the DRY principle. It also doesn't know about your existing utility methods, used page object models, or overall project structure, so it can't use existing code, and you must tweak and adjust the generated Playwright tests.
However, I still think npx playwright codegen
gives you a massive head start when you want to create a new Playwright script. But is manually clicking around to generate a test automation script still the best approach in 2025?
Isn't it time that AI robots take over all our dirty work? The answer to this question is the usual "it depends". Personally, I'm not convinced that AI can or should take on the massive responsibility of testing products entirely yet. However, AI has come a long way toward solving the test generation problem, and it can now generate useful Playwright end-to-end testing scripts for us.
Have you tried generating Playwright scripts with the new Playwright MCP server? If your answer is "NO", this post is for you. Let's go! 🫵
If you prefer video content, this post is also available as a YouTube video.
Generating end-to-end tests with LLMs isn't straightforward
I always liked the idea of "just" talking to an LLM to create Playwright scripts. It would be very nice if LLMs could figure things out on their own, but I remained skeptical, thinking "Nah! This can't work, can it?". After reading multiple tutorials advising to ask ChatGPT to generate Playwright scripts, I gave it a try…
And what can I say? After trying two different approaches, I figured that ChatGPT and friends cannot "just" generate Playwright scripts because the web is messy and complex.
So, what should you avoid when trying to generate Playwright tests with AI?
Failed attempt no. 1: generating end-to-end tests without additional context
Let me be clear: simply opening ChatGPT to tell it to create a test and expect a working result doesn't work. And I'm not saying it doesn't work well; it doesn't work at all.
Without providing additional context, the LLM will hallucinate its way forward and "create" tests by combining online tutorials, the scraped Playwright documentation and overall training data. This approach can't work because the LLM simply can't know how the target site is structured. It can not know what it's dealing with.
The AI confidently makes things up on the fly. Whenever you read an article or documentation telling you to "just ask ChatGPT" to write your Playwright tests, close the tutorial and run. Either the post is written by a human who doesn't know what they're talking about, or it's ironically written by a robot trying to appear more helpful than it is. In both cases, you'll end up following advice that leads to hallucinated tests that simply won't work!
AI-driven test generation is like all the AI tasks we're experimenting with today; providing valuable context is the secret sauce.
If you would like to learn more about the failed "just asking ChatGPT" approach, I have shared my findings on YouTube.
But what if you provide source code powering your application? Would this help?
Failed attempt no. 2: Adding source code to the context to generate end-to-end tests
With GitHub Copilot, Cursor, and Windsurf, it became fairly easy to feed LLMs with source code. You can manually add files to the chat conversation, and the mighty AI agents can even scan your project to add files autonomously to provide a good result.
This approach looked promising at first, and indeed, the results are a bit better if you provide the source code that powers a UI to the LLM to generate test code; however, when you try to generate end-to-end tests, it becomes quite obvious that this approach is still not more than a guessing game.
The agent will still make stupid mistakes and go in many unnecessary circles. Sometimes your generated test will work, sometimes it won't. Making the connection between source code and the rendered result is too tough for the statistical LLM brain to understand.
Additionally, this approach still can't help if your Frontend code renders data coming from databases or content management systems. Feeding source code to the LLMs often means instructing it to fly blind and still make up end-to-end tests.
If you would like to learn more about providing source code to generate tests, I have shared my findings on YouTube.
Rethinking the problem of end-to-end test generation
Websites today are built with React, Angular, Vue, or the new JS framework being released tomorrow. Interactions are driven by hundreds of kilobytes of JavaScript. The web isn't a collection of static files anymore; the web of today consists of interactive apps, rich experiences and data coming from who knows where.
Do you care what's powering Figma, Google Docs, or YouTube? No, you don't. Would you read the source code powering these apps to create a user-first end-to-end test? No, you wouldn't.
And that's the thing! You need to treat the AI like you want to be treated. And you need to approach the problem the same way.
Reading source code to create a test would only waste time and, in the worst case, confuse you. And if you're honest with yourself, the source code doesn't matter if you want to test an application, does it? The only thing that matters is the core functionality. You click a button and expect an element to show up. You fill out a form with an incorrect value, and you expect an error pop-up to scream at you.
End-to-end testing typically involves chained "if that, then that…" instructions and the underlying implementation doesn't matter. LLMs don't need source code as context! LLMs need to be aware of the site's rendered HTML, visited URLs, redirects, cookie values, and snapshots that show how interactions have changed the website. The LLM needs to know what it's supposed to test.
Is there a way to provide all this context to your favorite AI agent?
How the Model Context Protocol (MCP) changes the end-to-end testing game
To retrieve all this browser state, the AI agent needs to be able to control a browser in some way. And we're not talking about making a curl request to fetch some HTML. We're talking the real deal, telling the LLM to navigate to a specific URL so that it can open up a browser and read the page. And then, it needs to be able to interact with the page to click something, and then check the HTML to see what happened to perform the next action. Is this somehow possible?
The popular LLM providers tried different approaches to integrate "tool calls" into chat conversations; however, the industry eventually landed on adopting the Model Context Protocol (MCP) from Anthropic as the de facto standard.
Suppose you're using a client that supports the MCP protocol; you can connect to an MCP server, which will enable your AI agent to communicate with remote resources, such as your file system or email inbox. Connected MCP servers will give your favorite LLM the tools you always wished it had.
And the best part is that the community loves this idea so much, that there are already hundreds of MCP servers ready for you to connect to and try out. And of course the Playwright team joined the party, too.
If you haven't used MCP, it's understandable that this all sounds a bit magical. Let me show you how it all works on the example of Playwright MCP.
What is Playwright MCP?
The official Playwright MCP server enables AI to control a real browser using the Playwright APIs. Here are the core concepts using GitHub Copilot in VS Code.
After you've installed the MCP server (find the installation instructions for your AI client in the GitHub repository), you'll have a new set of tools available in the chat UI. Whenever you now interact with the LLM, it will ask the connected servers about the available tools and decide whether and when to call a tool.
Suppose you're using VS Code you can also nudge the AI to use a specific tool by using #
(e.g. #browser_click
).
And because Copilot knows that Playwright is available, it will decide to use Playwright when you want to navigate to a specific URL.
And this is already pretty cool. Above, you see that by asking Copilot to navigate to a URL, it decided to use Playwright's browser_navigate
tool to open up a browser with the desired URL. You can then ask it to perform actions and interact with the page in plain English. Wild!
However, that's all well and good, but does controlling a browser from within an LLM conversation help in generating end-to-end tests? To understand why LLMs can now magically generate end-to-end tests if you use the Playwright MCP server, you need to know what happens when your AI agent calls a provided Playwright tool, such as browser_navigate
.
The Playwright MCP tools provide the required context to generate end-to-end tests
The trick that makes all this possible is that whenever the AI decides to use a Playwright tool, such as browser_navigate
or browser_click
, the Playwright MCP server responds with a lot of useful context that enters your chat conversation.
Let's look at the example of browser_click
.
If you ask the LLM to click an element using Playwright, you can inspect the tool call right in the Copilot UI. The first part is the request made to the MCP server.
{
`ref`: `e150`,
`element`: `Customers link in the navigation`
}
The LLM calls the MCP server's browser_click
tool with an object that includes a ref and element property. But how does the LLM know about this ref?
Every Playwright tool call responds with additional information and context that will be used throughout the conversation. Here's the example response from the browser_click
tool.
// Click Customers link in the navigation
await page.getByRole('link', { name: 'Customers' }).nth(1).click();
```
- Page URL: https://www.checklyhq.com/customers/
- Page Title: Customers
- Page Snapshot
```yaml
- generic [ref=e1]:
- banner [ref=e4]:
- navigation [ref=e5]:
- link "Checkly - Home" [ref=e6] [cursor=pointer]:
- /url: /
- generic [ref=e7] [cursor=pointer]: Checkly - Home
- img [ref=e8] [cursor=pointer]
- generic [ref=e17]:
- button "Product" [ref=e19] [cursor=pointer]
- heading "DETECT" [level=3]
- link "Uptime Monitoring Fast, reliable availability and performance monitoring of URLs, TCP,
…
The MCP server responds with the Playwright code it ran (await page.getByRole(/* … */).nth(1).click()
), the resulting page URL, the page title and the page snapshot of the page after this action was executed. This is a lot of valuable context!
And if you inspect the page snapshot closely, you'll discover that every element in the page snapshot includes an element reference. This is how the AI can figure out which reference to click when you tell it to "Click the 'ABC link'.".
All this context is then embedded in your conversation with the AI agent.
USER: Please navigate to checklyhq.com.
---
AI: Called `browser_navigate`:
```js
await page.goto('https://checklyhq.com');
```
// Additional context (page URL, page title and the page snapshot)
// ...
// ...
---
USER: Now, please click on "Login".
---
AI: Called `browser_click`:
```js
await page.getByRole('link', { name: 'Login' }).click();
```
// Additional context (page URL, page title and the page snapshot)
// ...
// ...
And when you now add more browser instructions to your AI conversation, the LLM will have all the context it needs. It knows about the required Playwright code, the current page URLs and the page's state thanks to all the snapshots.
And with this additional context, we're almost there. Now we only need to convince the LLM to become a test generator.
How to transform your LLM into a Playwright test generator
By using the Playwright MCP server, you're now able to enrich a conversation with the context required to create Playwright tests. And if you connect Playwright MCP to a coding assistant like Copilot in VS Code, Cursor, or Claude Code, the agent can also read and write your test files. What's missing, then?
To make an AI agent generate tests and write spec files, we need to apply some prompt engineering (the cool kids also started saying "context engineering") to bring the LLM in the mood of generating tests.
Let me show you a prompt that I have had good results with so far.
## Instructions
You are a Playwright test generator and an expert in TypeScript, Frontend development, and Playwright end-to-end testing.
- You are given a scenario, and you need to generate a Playwright test for it.
- If you're asked to generate or create a Playwright test, use the tools provided by the Playwright MCP server to navigate the site and generate tests based on the current state and page snapshots.
- Do not generate tests based on assumptions. Use the Playwright MCP server to navigate and interact with sites.
- Access a new page snapshot before interacting with the page.
- Only after all steps are completed, emit a Playwright TypeScript test that uses @playwright/test based on the message history.
- When you generate the test code in the 'tests' directory, ALWAYS follow Playwright best practices.
- When the test is generated, always test and verify the generated code using `npx playwright test` and fix it if there are any issues.
With these few lines of plain English, you can set a new AI baseline to make the LLM accept a test case scenario, run it, and generate a test based on the executed actions. There are two things to know about this prompt.
First, you have to find a way to embed this prompt into your chat conversation. Of course, you could just paste it into your little chat window but this isn't a scalable way of working. In Cursor, you could use Cursor rules. In VS Code, you could use custom instructions or prompt files. I recommend having a standard place for your test generation prompt so that everyone can reuse it.
And second, a good AI prompt is never finished. You must continue to work on it and tweak it whenever you're unhappy with the AI-generated results. Don't check it in and forget about it. Read it, use it and iterate every time the LLM frustrates you!
But does this approach work?
Let me tell you the results can be very impressive when you use cutting-edge models like Claude Sonnet 4. Here are some example test generation instructions:
- Navigate to
https://checklyhq.com
- Find and click the "documentation" in the top navigation
- Check that the URL is
/docs
- Search for "Playwright test suite"
- Click the first search result regardless of its content
- Check that the resulting URL is
/docs/playwright-checks/
With the additional context provided by Playwright MCP, Claude Sonnet 4 is very capable of generating a valuable and working end-to-end test.
With the provided base prompt, the LLM will navigate the web and perform your defined tasks in a real browser. Thanks to Playwright MCP, it will then slurp in all the resulting page context. When it succeeds in doing your dirty work, it will analyze the conversation to generate a Playwright test with all the page and Playwright context. And if you instruct it to, it will also open up a terminal and run npx playwright test
to check if the generated tests pass.
This flow is kinda magical if it works. Give it a try!
To see this approach in action, the post is available as a video on YouTube.
Should you generate tests with Codegen or AI?
Does this now mean that you can leave the traditional Playwright codegen
command behind and hand off all the tasks to your favorite LLM? I don't think so. Let's compare the two approaches and look at their advantages and disadvantages.
Playwright's traditional codegen method is precise! If you think about your test cases and know what you want to test, you're responsible for performing the actions in the provided browser window. There are no surprises or misunderstandings about which elements should be clicked and filled. You're in the driver's seat, and Playwright codegen will transform your actions into code.
Additionally, using Playwright's codegen to generate Playwright code is fast. Recording a new test case barely takes more than a few minutes. However, in most cases and especially when you're working on a large scale project, you must tweak the generated code to extract variables and utilize existing project functionality. The command's purpose is to give you a head start, not to provide production-ready results.
Using AI to generate end-to-end tests is definitely slower than "clicking together" a test with oldschool code generation. Often it takes dozens of LLM calls and minutes of waiting to reach a working end-to-end test. This waiting time might be worth it, though, because AI knows your code base and can happily bring in your util methods so you can save time doing manual work. AI code generation is smarter than Playwright's code generation after all.
However, there's no guarantee that the AI test generation will work or, even more critical, that it will create a test that will do what you instructed. More than once, the LLM happily clicked a link in the footer even though I instructed it to click a header navigation link. It still succeeded with the task at hand but sometimes went different paths or found different ways to perform the browser actions.
Reviewing AI-generated code is always critical and this rule applies to AI-generated end-to-end tests, too!
Are AI-generated end-to-end tests the holy grail?
I know that many people are incredibly excited about AI test generation and I am, too. However, I'm also concerned. I had multiple conversations with folks telling me that they don't even look at Playwright code anymore and write their tests entirely in English.
Create a test that:
- navigates to a product page
- adds a product to the cart
- purchases it with the invoice payment method
I might be wrong, but today, this approach feels way too risky to me. Just because an AI can succeed at a task with some vague instructions, it doesn't mean that the application is working. Maybe the "add to cart" button is broken on one page, but works on another. I bet AI will happily discover the working way, ignoring that something is off. Vague instructions can quickly lead to vague test cases and this approach can bring you in a very risky situation because automated testing is more important than ever today. If everybody's vibe coding in your product department, automated tests are the non-negotiable safety net for all code entering the code base. Are you really ready to hand over your quality control and monitoring to AI? I'm not.
Of course, you could also avoid giving the AI all the responsibility and write very detailed and precise prompts. However, what's the point in writing technical and very detailed instructions in English prose when you could also record real actions in a minute?
I don't have all the answers and I don't think there's a right or wrong right now. The entire world is playing with this new technology and we're all figuring it out together. However, I am excited about all the things the Playwright MCP server enables us to do.
And while I'll still remain an AI skeptic for a bit longer, I started to lean into AI test code generation and even vibe coded some things already. And I can't lie, when I controlled my Checkly end-to-end monitoring with a few prompts, I was pretty blown away…
I'll keep you posted about my AI journey here on the blog!