problem discovery pt. 1
Many engineers hate writing unit tests for backend code, and don't test their code in reliable ways to find bugs. While AI copilots can speed up boilerplate coding, they struggle to implement new ideas and can even introduce subtle bugs of their own.
Benchify was created to address this gap, using math to make code provably correct.
While most AI tools write code through pattern recognition, Benchify takes a more rigorous approach. It extracts formal properties that define how the code is meant to behave, builds a precise model of that behavior, and then proves those properties hold true. This process ensures that functionality isn’t just assumed, it’s mathematically verified, producing reliable code in less than a minute.
problem discovery pt. 2
1.0 Example of an existing flow using Benchify as a Github extension.
VIDEO
1.1 Example of a Benchify summary report and table.
IMAGE

1
2
3
1.2 Example of a code block of unit tests.
IMAGE

4
5
6
Same issue with failing tests. Additional passing tests are mentioned, but not linked.
1.3 Example of a Benchify command prompt and response.
IMAGE
7
Useful for developers, but adds more bulk to the comment chain, making it difficult to check history.
8
Written in a much more digestible format. Still a wordy explanation that isn't entirely glanceable.
Guiding question
user research
We initially planned to interview developers from small companies that were using Benchify, but due to logistical constraints, we weren’t able to meet with them directly — so we adapted our research approach instead. We interviewed experienced student developers at the DALI Lab, who had similar collaborative workflows, and ran an informal heuristic evaluation within our project team to uncover potential usability issues.
Dev teams move quickly, but often skip tests. Tools like Benchify don't always keep up.
Bugs also get lost on task boards, so clearer communication would help small teams.
Long comment chains, slow runs with no progress updates, and confusing pull requests make it harder to act fast.
These insights gave us information on how to best structure an improved workflow, and also reiterated the existing pain points that we had experienced ourselves.
1.4 A diagram I created for the Benchify wireflow.
IMAGE

competitive analysis
Some notable patterns we found were clear, linear structure (repositories → files → code snippets), as well as breadcrumbs & file paths for navigation and progress indicators showing test completion.


ideation
It contained only the necessary information, in addition to helpful links. We also mocked up an initial progress indicator.
2.1 Benchify Github summary report, mocked up in Notion.
IMAGE

2.2 Benchify progress indicator for Github, mocked up in Notion.
IMAGE

We went through countless iterations, trying to adhere to similar flows of existing workspaces like Github and CircleCI, and asked our clients lots of questions throughout.
2.3 Some grayscale iterations of the dashboard flow.
IMAGE

2.4 Grayscale iterations we selected to develop into hi-fi designs.
IMAGE

design
Many features we proposed weren't developmentally feasible at the time in Benchify's early stage, so we went back to the drawing board often. Eventually, we settled on a simple but effective solution that showcased all the data hierarchies and outputs from property-based testing.
For proof-of-concepts that required more complex interactions, we also used Polymet.ai to quickly generate coded prototypes of our screens for our developers to reference.
3.1 Benchify dashboard flow.
VIDEO
Benchify's LLM generates descriptions of properties within the code. An "advanced editing" feature would allow the user to prompt a rewrite of the property description, which in turn would regenerate code in the form of examples and unit tests.
3.2 Advanced editing feature example.
VIDEO
We incorporated a breadcrumb / filepath feature so developers can navigate between data hierarchies (repository, pull request, file, and property)
3.3 Breadcrumb example.
VIDEO
We envisioned CLI interactions where users could type in keyboard shortcuts to access Benchify commands. I also prototyped a modal shortcut where users could also access those commands and search for files.
3.4 Command modal and functions.
VIDEO
Properties and their statuses are determined by Benchify's LLM, but LLMs can make mistakes. We added an archive feature so developers can move inaccurately classified properties out of the way.
3.5 Property archiving (and unarchiving).
VIDEO
impact
On the backend, our developers on the DALI team added support for additional languages, including TypeScript, and expanded the web platform workflow. Most recently, Benchify released a public update in April 2025, and its usage continues to grow weekly, with thousands of interactions by users.
final thoughts
Working fast in a startup-like environment means being ready to scrap ideas
We cycled back to the drawing board more times than expected — sometimes because of technical hurdles, other times due to shifting client goals. It wasn’t always easy, but quick iteration and persistence helped us uncover what really worked.
When data is really technically complex, it’s up to us as designers to make it approachable
To really understand the product, we spent countless hours mapping and discussing how Benchify's backend processed data and surfaced it to users. I learned how crucial information hierarchy and visual design are in helping users interpret technically dense systems without losing trust or context.
Sometimes design contribution means defining the vision, even if you’re not there for the final implementation
When the team continued without designers after the first term, I learned the importance of clarity, documentation, and alignment — making sure the “why” behind each decision carried forward during implementation.
With more time, I would have loved to design with more granularity, such as refining edge cases, adding more microinteractions, and designing for scalability. We also had ideas for “stretch” features that could have expanded the product’s potential beyond the MVP, such as a control flow graph that would visually depict how different areas of a user's database interacted with eachother.




