Discussion: We need evals or something #68
Replies: 3 comments 2 replies
-
I agree with this (I raised in #30). I believe a long-term solution is needed that:
|
Beta Was this translation helpful? Give feedback.
-
How about the following idea?
That gives us at least some sort of automation and could probably be done quite easy. |
Beta Was this translation helpful? Give feedback.
-
I've converted this to a discussion (and obviously turned on discussions 🤣) as it's more of a logical place for this rather than in an issue - although it's possible that issues may be extracted from the ideas that come up here. As we're ticking over the 100 files mark, the points raised here are something that is top of mind for the team, and I look at it from two perspectives, discoverability and confidence. DiscoverabilityAt the time of writing, there are 98 files in the repo with another 10 PR's in review/awaiting action, and this is only one week since we launched the repo. It's safe to say that this has exceeded our expectations in growth, and I'm so glad to see that we've have something that the community is finding valuable. But this comes with a challenge in discoverability. The README we have is fine, but it's ultimately not scalable, as it's already becoming very long to scan through and find the files you want in your repo. This is going to be the next major thing we tackle. Confidence@Tiberriver256 makes a good point in how can we be confident in the files in the repo, particularly if they are iterated upon, how confident are we that the new revision is proving to be of a net benefit over the existing revision. There is more to it though. Models aren't equal, so how do we know what the "right" model is to use? Take the 4.1 Beast Mode chat mode, it's optimisations for gpt-4.1, and while you can use it with any model, it's not optimised to work elsewhere so you may get a poor outcome from, say, Claude Sonnet 3.7. This doesn't mean that the chat mode is bad, it just means that a combination of factors have resulted in a less-optimal outcome. Then there is evaluations, and while there are a bunch of evaluation approaches that can be undertaken, including using GitHub models as @nullchimp points out, we need to take into consideration the scope of what GitHub Copilot does for context, such as adding tools, files, workspace, etc. I'm by no means an expert on evaluations, but I would be cautious that just taking a It is something that we'll be diving into more, but they are larger and more complex challenges to solve, so we're open to suggestions and feedback as we iterate. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
There's a flood of these coming in. How do we know they work? How do we know which ones are better than others?
If not a hard eval script to run, could we have a community rating system that would at least crowd source these things.
If someone wants to make a PR to change one though... How do we know the change is an improvement other than just vibes?
Really curious to hear others thoughts on this. It's something we're struggling to figure out internally at my company with internal prompt/instructions/modes too.
Beta Was this translation helpful? Give feedback.
All reactions