Farsight utilizes the Prometheus prompting methodology to auto-evaluate system prompts.
To begin creating your rubric, we suggest
1) Synthesizing a description of your use case. For example:
To develop a secure and efficient internal HR chatbot for a financial
institution that assists employees with HR-related queries while ensuring the
protection of private personal information.
2) Prompting by using chatGPT to generate your evaluation rubric with your use case in the Prometheus prompt. To do so, prompt the chat as follows:
Given this use case: To develop a secure and efficient internal HR chatbot for a financial
institution that assists employees with HR-related queries while ensuring the
protection of private personal information.
I would like to create an evaluation rubric to effectively evaluate chat bot
responses. Can you provide an example {instruction}, example {reference_answer},
can fill in the {criteria_description}, and the five {score_descriptions} for my
use case? Please keep the rest of the format exactly the same. Please create one
evaluation rubric from 1 to 5 with no subcategories.
### Reference Answer (Score 5):
{reference_answer}
### Score Rubric:
[{criteria_description}]
Score 1: {score1_description}
Score 2: {score2_description}
Score 3: {score3_description}
Score 4: {score4_description}
Score 5: {score5_description}
Please provide a single, consolidated rubric for evaluating these criteria.
Example Response:
Here is an example response from chatGPT, simply input the reference answer and score rubric into our get_best_prompt function with a few different prompts to evaluate.
Certainly! Here's an example evaluation rubric for the given use case:
### Reference Answer (Score 5):
The chatbot response addresses the HR-related query accurately, provides relevant
information, and ensures the protection of private personal information by following
security protocols and compliance standards.
### Score Rubric:
**[{criteria_description}]**
Score 1: The response does not address the HR-related query accurately and fails to
provide relevant information. It does not ensure the protection of private personal
information.
Score 2: The response partially addresses the HR-related query but lacks accuracy and
relevant information. There are concerns about the protection of private personal
information.
Score 3: The response adequately addresses the HR-related query with some accuracy
and relevant information. It demonstrates basic efforts to protect private personal
information.
Score 4: The response accurately addresses the HR-related query, provides relevant
information, and demonstrates a good level of protection for private personal
information.
Score 5: The response precisely addresses the HR-related query, offers highly
relevant information, and ensures an exemplary level of protection for private
personal information. It adheres to the highest security standards and compliance
requirements.
This rubric aims to assess the accuracy and relevance of the chatbot's responses to
HR-related queries while emphasizing the importance of ensuring the security of
private personal information. The scores progressively reflect the chatbot's
performance, with a score of 5 representing an exemplary response that meets the
highest standards of accuracy and security.