Effectiveness of ChatGPT 5 vs. ChatGPT 4.1 in Lidana

Anagha Suvarna
Sep 10, 2025
3 min read

Updated: Sep 11, 2025

We recently introduced GPT-5 models in Lidana and wanted to see how it stacks up against GPT-4.1—especially when it comes to generating test cases.

To do that, I gave both models the same scenario, same set of screenshots, and same instructions. Here’s what happened.

The Prompt

“Generate test case for this scenario:

While generating test cases using AI, users can attach external links such as JIRA or Figma.

Also account for:

For first-time users who haven’t yet connected their JIRA or Figma accounts, attempting to add one of these links should trigger a warning prompting them to connect their account. Once the account is successfully connected, the Figma link is embedded with details.
Users can add a maximum of 20 links. These can be added either individually or by pasting multiple links at once.
Link validation is performed within the popup modal itself.
When hovering over an added link, action buttons and tooltips appear.
1. Note: These tooltip texts differ for JIRA and Figma links.

The UI for these screens and the complete flow can be seen in the attached screenshots.

I want two types of test cases:

A flow validation test case that covers only the positive path.
An error validation test case that covers all negative scenarios.”

What Happened with GPT-4.1: Close, But Not Complete

GPT-4.1 did give me a usable test case—clean, structured, and functional—just the essentials. The basics were all there. Here’s a snippet of what it returned:

Positive Flow Validation Test Case:

Positive flow test case generated using OpenAI's ChatGPT 4.1 model in Lidana. — Positive test cases generated by Lidana AI (GPT-4.1)

Sounds fine but it lacks specificity.

For example: “Click ‘Add Link(s)’.” — Where? In the modal or the panel? There’s no mention of this button appearing in the previous step.

These missing UI references mean a tester would need to check the Figma or screenshots to know what’s really going on.

Error Validation Test Case:

Negative flow test case generated using OpenAI's ChatGPT 4.1 model in Lidana. — Negative test cases generated by Lidana AI (GPT-4.1)

This is where issues start showing up.

Ambiguous outcomes: "A warning or prompt appears" — which is it? In our UI, it's a banner, not a prompt or warning modal. So the model misrepresents the actual component, leading to inaccurate test expectations.

Hallucinations: In reality, mixed links don’t show which ones are invalid. Instead, the “Add Link(s)” button just remains disabled with an inline error. GPT-4.1 imagined a UI feature that doesn’t exist.

In short, the GPT-4.1 output read more like a checklist than a robust, QA-proof test case. Usable, but risky if taken at face value.

What Happened with GPT-5:

GPT-5’s test cases not only captured the exact steps but also included component state transitions, UI element behavior, and accurate interpretations of the attached screenshots.

Positive Flow Validation Test Case:

Positive flow test case generated using OpenAI's latest ChatGPT 5 model in Lidana. — Positive test cases generated by Lidana AI (GPT-5)

GPT-5 really got into the details—what happens where and after what. It accounted for subtle flows like the redirection to Figma in a new browser tab—something GPT-4.1 completely overlooked. It also correctly placed the Figma account connection and OAuth authentication in the positive test flow, while GPT-4.1 mistakenly listed them as a negative scenario.

Error Validation Test Case:

Negative flow test case generated using OpenAI's latest ChatGPT 5 model in Lidana. — Negative test cases generated by Lidana AI (GPT-5)

It captured the flows and component states along with the exact error messages displayed in the UI. That means users don’t have to cross-check Figma screens to verify wording—every string, tooltip, and validation message is embedded right into the test case.

Most importantly, GPT-5 significantly reduced hallucinations and, in my testing, consistently produced accurate outputs—even in cases where the expected behavior wasn’t explicitly specified. From what I observed, GPT-5 relied strictly on the provided artifacts and avoided making assumptions, which led to far more reliable and trustworthy test cases.

GPT-5 also accounted for error validations specific to the OAuth window—something GPT-4.1 completely missed. It didn’t just outline what actions to take; it clearly described what to expect at each interaction point, including UI transitions, fallback states, and retry flows.

Conclusion

GPT-5’s test cases were detailed, context-aware, and avoided inventing UI behavior. That alone reduced the manual overhead of verifying each step, saving hours of QA prep time.

It didn’t just say what to do—it told me what to expect at every interaction point, making the output bulletproof for QA.

GPT-4.1 still holds value, especially for quick sanity checks, lightweight outlines, or when speed matters more than depth. With the right prompts or additional artifacts, it can produce solid test cases. That said, GPT-5 tends to reach the same level of depth and accuracy much faster, often on the first try and with minimal input—making it better suited for more complex or detail-heavy scenarios.