Discussion: We need evals or something #68

Tiberriver256 · 2025-07-06T01:36:46Z

Tiberriver256
Jul 6, 2025

There's a flood of these coming in. How do we know they work? How do we know which ones are better than others?

If not a hard eval script to run, could we have a community rating system that would at least crowd source these things.

If someone wants to make a PR to change one though... How do we know the change is an improvement other than just vibes?

Really curious to hear others thoughts on this. It's something we're struggling to figure out internally at my company with internal prompt/instructions/modes too.

PlagueHO · 2025-07-07T21:49:12Z

PlagueHO
Jul 7, 2025

I agree with this (I raised in #30).

I believe a long-term solution is needed that:

Marketplace (like extensions): A marketplace with community ratings, details, examples, sample outputs, use cases, what MCP servers/tools are required (e.g. a prompt file that creates a GH issue will need GitHub Issue).
Discoverable: right from Copilot (I made it slightly easier with the Suggest Awesome GitHub Copilot Chatmodes prompts, but you still first need those ... and to know to use them.
Ability to create "packs" of prompts/chatmodes - many of my prompts are interrelated and go together.
Automated Test and validation suites: How are these validated for safety, reliability, recommendations on models to use with.
Ability to create an organizational prompt files/chatmode marketplace (almost like GitHub packages).

2 replies

craigbekker Jul 16, 2025

On the packs. Could this just be a folder called packs and then a folder for the pack with its own set of folders, mirroring the folders in the root folder structure + a Readme?

I have a set of related prompts I would love to contribute here. I'll try to put up a PR when I get a gap this evening for context.

I have created a set of prompts I am using to help me with a side project. The idea is that I wanted to break down multiple feature sets into small enough issues so I can pawn some work with GitHub copilot by just assigning the work.

It, however, "draws" on context I created by documenting my architecture/mission statement/standards docs, etc. (It's been about 2 months of me playing, planning, and creating context to try to get as specific an outcome as possible when giving the work to an agent)

The idea was that the prompts would be a bit more generic, and the context would be specific to the project, allowing me to reuse this workflow on a different project.(reading the contribution guidelines, this may have been a mistake)

I have unfortunately marked my repo as private as it's still very much a WIP, and I am not sure if this will stay a learning project or where I'm going with it. I would, however, happy add someone from this community if they were interested in taking a peak

I had quite a few ideas, so I built a bit of a workflow out of prompt files, mapped to a "way or work," where I do all the breakdowns within the repo.

On a high level, It follows the workflow

Idea > Epic > High level architecture > Feature PRd > Feature level Techical breakdown > test plan > githib issue planner.

I then have a set of "persona promps" for implementing that im still refining heavily as I move out of the ideation phase into building

I am busy with my first big feature and petty happy with where I got to so far

Each step allows me to stop, review, and tweak the plan, the same way I would at my day job, before moving on, breaking something down further, or building. I'm also able to build and test small iterations at a time, allowing me to stay on track, follow a structured process and make good enough progress, when I actually get a few hours to cook.

My concern with putting up a PR, however, is that this is really a set of prompts, a workflow, and requires quite a bit of supporting docs that are specific to your project. It would clutter the current Readme on this repo quite a bit, as it is more geared towards individual files. I'm pretty sure my PR would be rejected 🙃 but I'm finding it really useful and hoping someone else could benefit too

craigbekker Jul 30, 2025

I have created the above-mentioned PR, #135

I added in the ability to have a collection, to better group the set, and a set of planning prompts I'm using on my projects to load issues on github. The intention was to break down an idea small enough to the point where some work can just be assigned to github copilot.

nullchimp · 2025-07-08T22:27:12Z

nullchimp
Jul 8, 2025

How about the following idea?

Provide a script that converts prompts / chat modes / instructions into a *.prompt.yml (similar to update-readme.js)
Extend the contribution guidelines to enforce evaluators on them (a couple could be automated I guess, but probably not all)
Provide an action that runs on PR to test it with GitHub Models (gh models eval *.prompt.yml)

That gives us at least some sort of automation and could probably be done quite easy.

0 replies

aaronpowell · 2025-07-09T02:01:49Z

aaronpowell
Jul 9, 2025
Maintainer

I've converted this to a discussion (and obviously turned on discussions 🤣) as it's more of a logical place for this rather than in an issue - although it's possible that issues may be extracted from the ideas that come up here.

As we're ticking over the 100 files mark, the points raised here are something that is top of mind for the team, and I look at it from two perspectives, discoverability and confidence.

Discoverability

At the time of writing, there are 98 files in the repo with another 10 PR's in review/awaiting action, and this is only one week since we launched the repo. It's safe to say that this has exceeded our expectations in growth, and I'm so glad to see that we've have something that the community is finding valuable.

But this comes with a challenge in discoverability. The README we have is fine, but it's ultimately not scalable, as it's already becoming very long to scan through and find the files you want in your repo.

This is going to be the next major thing we tackle.

Confidence

@Tiberriver256 makes a good point in how can we be confident in the files in the repo, particularly if they are iterated upon, how confident are we that the new revision is proving to be of a net benefit over the existing revision.

There is more to it though. Models aren't equal, so how do we know what the "right" model is to use? Take the 4.1 Beast Mode chat mode, it's optimisations for gpt-4.1, and while you can use it with any model, it's not optimised to work elsewhere so you may get a poor outcome from, say, Claude Sonnet 3.7. This doesn't mean that the chat mode is bad, it just means that a combination of factors have resulted in a less-optimal outcome.

Then there is evaluations, and while there are a bunch of evaluation approaches that can be undertaken, including using GitHub models as @nullchimp points out, we need to take into consideration the scope of what GitHub Copilot does for context, such as adding tools, files, workspace, etc. I'm by no means an expert on evaluations, but I would be cautious that just taking a *.prompt.md and running it as an evaluation without the context is going to be indicative of the actual outcome, and "correctness" can be even more opaque here.

It is something that we'll be diving into more, but they are larger and more complex challenges to solve, so we're open to suggestions and feedback as we iterate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: We need evals or something #68

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discussion: We need evals or something #68

Uh oh!

Tiberriver256 Jul 6, 2025

Replies: 3 comments · 2 replies

Uh oh!

PlagueHO Jul 7, 2025

Uh oh!

craigbekker Jul 16, 2025

Uh oh!

craigbekker Jul 30, 2025

Uh oh!

Uh oh!

nullchimp Jul 8, 2025

Uh oh!

aaronpowell Jul 9, 2025 Maintainer

Discoverability

Confidence

Tiberriver256
Jul 6, 2025

Replies: 3 comments 2 replies

PlagueHO
Jul 7, 2025

nullchimp
Jul 8, 2025

aaronpowell
Jul 9, 2025
Maintainer