Web app design

Benchify

Automating software testing using techniques from rocket science and chip design.

client

Benchify (YC S24)

timeline

September - November 2024 (6 weeks)

CONTEXT

Benchify (Y-Combinator, summer 2024), founded by two Dartmouth ’19 graduates, is a testing-as-a-service platform that streamlines software testing so developers can code faster with fewer bugs. By leveraging LLMs to automate tedious testing tasks, it reduces time spent debugging and frees engineers to focus on building new features.

Receiving over $500k in initial funding that summer, Benchify partnered with Dartmouth designers and developers at DALI to further build out their product. As a product designer on the project, I designed and prototyped a conceptual web-app dashboard that enhanced workflow efficiency for developers using Benchify.

IMPACT

Our design work helped Benchify envision the initial look and function of their product! Previously, it was only a Github tool. Today, Benchify is expanding into new use cases and features as a web platform.

Role

Product designer

Team

3 designers

4 engineers

1 project manager

Tools

Figma

polymet

Notion

Contributions

Research

Ideation

Design

pROTOTYPING

problem discovery pt. 1

The current state of software development is inefficient.

50%

of a developer's workday consists of testing and debugging code

45%

of projects often face delays or setbacks

34%

of companies experience unplanned downtime each month

Many engineers hate writing unit tests for backend code, and don't test their code in reliable ways to find bugs. While AI copilots can speed up boilerplate coding, they struggle to implement new ideas and can even introduce subtle bugs of their own.

Benchify was created to address this gap, using math to make code provably correct.

While most AI tools write code through pattern recognition, Benchify takes a more rigorous approach. It extracts formal properties that define how the code is meant to behave, builds a precise model of that behavior, and then proves those properties hold true. This process ensures that functionality isn’t just assumed, it’s mathematically verified, producing reliable code in less than a minute.

problem discovery pt. 2

However, Benchify’s existing user experience was overwhelming.

Used as a Github extension, each pull request generated lengthy comment chains, making it difficult to track changes and identify issues, especially in larger projects with frequent commits.

1.0 Example of an existing flow using Benchify as a Github extension.

VIDEO

The existing summary report was too dense, and could be simplified to help developers take action more easily.

1.1 Example of a Benchify summary report and table.

IMAGE

Bullet summary

A high level summary detailing overall issues seen in the code, but shows no visuals of the code.

Successful tests

Less important than issues and failed tests, and should be lower in the hierarchy. Repeated again in the table.

Tests table

Organized in a more digestible format, but is static and doesn’t link the developer to the specific issues in the code.

Plus, the unit tests that Benchify provided were hard to read and organize.

1.2 Example of a code block of unit tests.

IMAGE

Properties

The definition of a "property" is unclear. There are multiple properties for the same function, which densely packs the code.

Failing tests

Only a few tests and their inputs are listed. Additional failing tests found by Benchify are not linked for viewing.

Passing tests

Same issue with failing tests. Additional passing tests are mentioned, but not linked.

Commands added more insight, but they could only be used in the same pull request chain.

1.3 Example of a Benchify command prompt and response.

IMAGE

Benchify commands

Useful for developers, but adds more bulk to the comment chain, making it difficult to check history.

Unit tests

Written in a much more digestible format. Still a wordy explanation that isn't entirely glanceable.

Overall, we saw that users needed a clearer, more streamlined way to understand test results and improve workflow efficiency.

Guiding question

How might we… organize complex technical data into a more intuitive workflow for developers?

user research

We interviewed software developers gather additional pain points from the development process.

We initially planned to interview developers from small companies that were using Benchify, but due to logistical constraints, we weren’t able to meet with them directly — so we adapted our research approach instead. We interviewed experienced student developers at the DALI Lab, who had similar collaborative workflows, and ran an informal heuristic evaluation within our project team to uncover potential usability issues.

Dev teams move quickly, but often skip tests. Tools like Benchify don't always keep up.

Bugs also get lost on task boards, so clearer communication would help small teams.

Long comment chains, slow runs with no progress updates, and confusing pull requests make it harder to act fast.

We also asked them to walk through the existing Benchify experience and share their initial thoughts.

These insights gave us information on how to best structure an improved workflow, and also reiterated the existing pain points that we had experienced ourselves.

"It doesn't tell the developer where to start looking for the error"

“There’s not an easy way to look through the current report -- lots of back and forth.”

"This would become complicated in more complex projects, especially when functions and files rely on eachother."

We then tackled the overall data hierarchy structure and user flow. We did this concurrently with initial designs, since the startup was moving fast.

1.4 A diagram I created for the Benchify wireflow.

IMAGE

competitive analysis

To better understand data visualization, we researched similar code review apps that showed cleaner ways to organize workflows and files.

Some notable patterns we found were clear, linear structure (repositories → files → code snippets), as well as breadcrumbs & file paths for navigation and progress indicators showing test completion.

ideation

We first created a simpler, more condensed mockup of the summary report.

It contained only the necessary information, in addition to helpful links. We also mocked up an initial progress indicator.

2.1 Benchify Github summary report, mocked up in Notion.

IMAGE

2.2 Benchify progress indicator for Github, mocked up in Notion.

IMAGE

Next up was the dashboard flow — the main experience.

We went through countless iterations, trying to adhere to similar flows of existing workspaces like Github and CircleCI, and asked our clients lots of questions throughout.

2.3 Some grayscale iterations of the dashboard flow.

IMAGE

2.4 Grayscale iterations we selected to develop into hi-fi designs.

IMAGE

design

We went through tons of design restarts as Benchify moved in many directions.

Many features we proposed weren't developmentally feasible at the time in Benchify's early stage, so we went back to the drawing board often. Eventually, we settled on a simple but effective solution that showcased all the data hierarchies and outputs from property-based testing.

For proof-of-concepts that required more complex interactions, we also used Polymet.ai to quickly generate coded prototypes of our screens for our developers to reference.

3.1 Benchify dashboard flow.

VIDEO

Benchify's LLM generates descriptions of properties within the code. An "advanced editing" feature would allow the user to prompt a rewrite of the property description, which in turn would regenerate code in the form of examples and unit tests.

3.2 Advanced editing feature example.

VIDEO

Additional features for ease of access and quickness:

We incorporated a breadcrumb / filepath feature so developers can navigate between data hierarchies (repository, pull request, file, and property)

3.3 Breadcrumb example.

VIDEO

We envisioned CLI interactions where users could type in keyboard shortcuts to access Benchify commands. I also prototyped a modal shortcut where users could also access those commands and search for files.

3.4 Command modal and functions.

VIDEO

Properties and their statuses are determined by Benchify's LLM, but LLMs can make mistakes. We added an archive feature so developers can move inaccurately classified properties out of the way.

3.5 Property archiving (and unarchiving).

VIDEO

impact

Benchify began further development throughout winter of 2025, and is now expanding into more features!

On the backend, our developers on the DALI team added support for additional languages, including TypeScript, and expanded the web platform workflow. Most recently, Benchify released a public update in April 2025, and its usage continues to grow weekly, with thousands of interactions by users.

final thoughts

My takeaways

Working fast in a startup-like environment means being ready to scrap ideas

We cycled back to the drawing board more times than expected — sometimes because of technical hurdles, other times due to shifting client goals. It wasn’t always easy, but quick iteration and persistence helped us uncover what really worked.

When data is really technically complex, it’s up to us as designers to make it approachable

To really understand the product, we spent countless hours mapping and discussing how Benchify's backend processed data and surfaced it to users. I learned how crucial information hierarchy and visual design are in helping users interpret technically dense systems without losing trust or context.

Sometimes design contribution means defining the vision, even if you’re not there for the final implementation

When the team continued without designers after the first term, I learned the importance of clarity, documentation, and alignment — making sure the “why” behind each decision carried forward during implementation.

What could've been improved

With more time, I would have loved to design with more granularity, such as refining edge cases, adding more microinteractions, and designing to scale for large teams. We also had ideas for “stretch” features that could have expanded the product’s potential beyond the MVP, such as a control flow graph that would visually depict how different areas of a user's database interacted with eachother.

thanks for being here.
let's connect!

Twitter

Resumé

thanks for being here.
let's connect!

Twitter

Resumé

thanks for being here.
let's connect!

Twitter

Resumé