problem discovery pt. 1

The current state of software development is inefficient.

The current state of software development is inefficient.

The current state of software development is inefficient.

50%

50%

of a developer's workday consists of testing and debugging code

of a developer's workday consists of testing and debugging code

45%

45%

of projects often face delays or setbacks

of projects often face delays or setbacks

34%

34%

of companies experience unplanned downtime each month

of companies experience unplanned downtime each month

Many engineers hate writing unit tests for backend code, and don't test their code in reliable ways to find bugs. While AI copilots can speed up boilerplate coding, they struggle to implement new ideas and can even introduce subtle bugs of their own.

Benchify was created to address this gap, using math to make code provably correct.

While most AI tools write code through pattern recognition, Benchify takes a more rigorous approach. It extracts formal properties that define how the code is meant to behave, builds a precise model of that behavior, and then proves those properties hold true. This process ensures that functionality isn’t just assumed, it’s mathematically verified, producing reliable code in less than a minute.

problem discovery pt. 2

However, Benchify’s existing user experience was overwhelming.

However, Benchify’s existing user experience was overwhelming.

However, Benchify’s existing user experience was overwhelming.

Used as a Github extension, each pull request generated lengthy comment chains, making it difficult to track changes and identify issues, especially in larger projects with frequent commits.

Used as a Github extension, each pull request generated lengthy comment chains, making it difficult to track changes and identify issues, especially in larger projects with frequent commits.

Used as a Github extension, each pull request generated lengthy comment chains, making it difficult to track changes and identify issues, especially in larger projects with frequent commits.

1.0 Example of an existing flow using Benchify as a Github extension.

VIDEO

The existing summary report was too dense, and could be simplified to help developers take action more easily.

The existing summary report was too dense, and could be simplified to help developers take action more easily.

1.1 Example of a Benchify summary report and table.

IMAGE

1

Bullet summary

Bullet summary

A high level summary detailing overall issues seen in the code, but shows no visuals of the code.

A high level summary detailing overall issues seen in the code, but shows no visuals of the code.

2

Successful tests

Successful tests

Less important than issues and failed tests, and should be lower in the hierarchy. Repeated again in the table.

Less important than issues and failed tests, and should be lower in the hierarchy. Repeated again in the table.

3

Tests table

Tests table

Organized in a more digestible format, but is static and doesn’t link the developer to the specific issues in the code.

Organized in a more digestible format, but is static and doesn’t link the developer to the specific issues in the code.

Plus, the unit tests that Benchify provided were hard to read and organize.

Plus, the unit tests that Benchify provided were hard to read and organize.

1.2 Example of a code block of unit tests.

IMAGE

4

Properties

Properties

The definition of a "property" is unclear. There are multiple properties for the same function, which densely packs the code.

The definition of a "property" is unclear. There are multiple properties for the same function, which densely packs the code.

5

Failing tests

Failing tests

Only a few tests and their inputs are listed. Additional failing tests found by Benchify are not linked for viewing.

Only a few tests and their inputs are listed. Additional failing tests found by Benchify are not linked for viewing.

6

Passing tests

Passing tests

Same issue with failing tests. Additional passing tests are mentioned, but not linked.

Commands added more insight, but they could only be used in the same pull request chain.

Commands added more insight, but they could only be used in the same pull request chain.

1.3 Example of a Benchify command prompt and response.

IMAGE

7

Benchify commands

Benchify commands

Useful for developers, but adds more bulk to the comment chain, making it difficult to check history.

8

Unit tests

Unit tests

Written in a much more digestible format. Still a wordy explanation that isn't entirely glanceable.

Overall, we saw that users needed a clearer, more streamlined way to understand test results and improve workflow efficiency.

Overall, we saw that users needed a clearer, more streamlined way to understand test results and improve workflow efficiency.

Guiding question

How might we organize complex technical data into a more intuitive workflow for developers?

How might we organize complex technical data into a more intuitive workflow for developers?

user research

We interviewed software developers gather additional pain points from the development process.

We interviewed software developers gather additional pain points from the development process.

We interviewed software developers gather additional pain points from the development process.

We initially planned to interview developers from small companies that were using Benchify, but due to logistical constraints, we weren’t able to meet with them directly — so we adapted our research approach instead. We interviewed experienced student developers at the DALI Lab, who had similar collaborative workflows, and ran an informal heuristic evaluation within our project team to uncover potential usability issues.

Dev teams move quickly, but often skip tests. Tools like Benchify don't always keep up.

Bugs also get lost on task boards, so clearer communication would help small teams.

Long comment chains, slow runs with no progress updates, and confusing pull requests make it harder to act fast.

We also asked them to walk through the existing Benchify experience and share their initial thoughts.

We also asked them to walk through the existing Benchify experience and share their initial thoughts.

We also asked them to walk through the existing Benchify experience and share their initial thoughts.

These insights gave us information on how to best structure an improved workflow, and also reiterated the existing pain points that we had experienced ourselves.

"It doesn't tell the developer where to start looking for the error"

"It doesn't tell the developer where to start looking for the error"

“There’s not an easy way to look through the current report -- lots of back and forth.”

“There’s not an easy way to look through the current report -- lots of back and forth.”

"This would become complicated in more complex projects, especially when functions and files rely on eachother."

"This would become complicated in more complex projects, especially when functions and files rely on eachother."

We then tackled the overall data hierarchy structure and user flow. We did this concurrently with initial designs, since the startup was moving fast.

We then tackled the overall data hierarchy structure and user flow. We did this concurrently with initial designs, since the startup was moving fast.

1.4 A diagram I created for the Benchify wireflow.

IMAGE

competitive analysis

To better understand data visualization, we researched similar code review apps that showed cleaner ways to organize workflows and files.

To better understand data visualization, we researched similar code review apps that showed cleaner ways to organize workflows and files.

To better understand data visualization, we researched similar code review apps that showed cleaner ways to organize workflows and files.

Some notable patterns we found were clear, linear structure (repositories → files → code snippets), as well as breadcrumbs & file paths for navigation and progress indicators showing test completion.

ideation

We first created a simpler, more condensed mockup of the summary report.

We first created a simpler, more condensed mockup of the summary report.

We first created a simpler, more condensed mockup of the summary report.

It contained only the necessary information, in addition to helpful links. We also mocked up an initial progress indicator.

2.1 Benchify Github summary report, mocked up in Notion.

IMAGE

2.2 Benchify progress indicator for Github, mocked up in Notion.

IMAGE

Next up was the dashboard flow — the main experience.

Next up was the dashboard flow — the main experience.

We went through countless iterations, trying to adhere to similar flows of existing workspaces like Github and CircleCI, and asked our clients lots of questions throughout.

2.3 Some grayscale iterations of the dashboard flow.

IMAGE

2.4 Grayscale iterations we selected to develop into hi-fi designs.

IMAGE

design

We went through tons of design restarts as Benchify moved in many directions.

We went through tons of design restarts as Benchify moved in many directions.

We went through tons of design restarts as Benchify moved in many directions.

Many features we proposed weren't developmentally feasible at the time in Benchify's early stage, so we went back to the drawing board often. Eventually, we settled on a simple but effective solution that showcased all the data hierarchies and outputs from property-based testing.

For proof-of-concepts that required more complex interactions, we also used Polymet.ai to quickly generate coded prototypes of our screens for our developers to reference.

3.1 Benchify dashboard flow.

VIDEO

Benchify's LLM generates descriptions of properties within the code. An "advanced editing" feature would allow the user to prompt a rewrite of the property description, which in turn would regenerate code in the form of examples and unit tests.

3.2 Advanced editing feature example.

VIDEO

Additional features for ease of access and quickness:

Additional features for ease of access and quickness:

We incorporated a breadcrumb / filepath feature so developers can navigate between data hierarchies (repository, pull request, file, and property)

3.3 Breadcrumb example.

VIDEO

We envisioned CLI interactions where users could type in keyboard shortcuts to access Benchify commands. I also prototyped a modal shortcut where users could also access those commands and search for files.

3.4 Command modal and functions.

VIDEO

Properties and their statuses are determined by Benchify's LLM, but LLMs can make mistakes. We added an archive feature so developers can move inaccurately classified properties out of the way.

3.5 Property archiving (and unarchiving).

VIDEO

impact

Benchify began further development throughout winter of 2025, and is now expanding into more features!

Benchify began further development throughout winter of 2025, and is now expanding into more features!

Benchify began further development throughout winter of 2025, and is now expanding into more features!

On the backend, our developers on the DALI team added support for additional languages, including TypeScript, and expanded the web platform workflow. Most recently, Benchify released a public update in April 2025, and its usage continues to grow weekly, with thousands of interactions by users.

final thoughts

My takeaways

My takeaways

  1. Working fast in a startup-like environment means being ready to scrap ideas

We cycled back to the drawing board more times than expected — sometimes because of technical hurdles, other times due to shifting client goals. It wasn’t always easy, but quick iteration and persistence helped us uncover what really worked.

  1. When data is really technically complex, it’s up to us as designers to make it approachable

To really understand the product, we spent countless hours mapping and discussing how Benchify's backend processed data and surfaced it to users. I learned how crucial information hierarchy and visual design are in helping users interpret technically dense systems without losing trust or context.

  1. Sometimes design contribution means defining the vision, even if you’re not there for the final implementation

When the team continued without designers after the first term, I learned the importance of clarity, documentation, and alignment — making sure the “why” behind each decision carried forward during implementation.

What could've been improved

What could've been improved

With more time, I would have loved to design with more granularity, such as refining edge cases, adding more microinteractions, and designing for scalability. We also had ideas for “stretch” features that could have expanded the product’s potential beyond the MVP, such as a control flow graph that would visually depict how different areas of a user's database interacted with eachother.

thanks for being here.
let's connect!

Rachael Huang © 2026

thanks for being here.
let's connect!

Rachael Huang © 2026

thanks for being here.
let's connect!

Rachael Huang © 2026