Improve and evaluate your LLM applications with a few simple steps
Get evaluating now! Follow a few simple steps to improve your LLMs.
Want to integrate even quicker? Try out Farsight AI on a Colab notebook here.
Note: While you have the flexibility to assess the results of any Language Model (LLM), we specifically leverage OpenAI for the evaluation functions in Farsight AI. To utilize our package, you must have access to an OpenAI API Key.
Utilize our evaluation suite to score your LLM outputs. Below is an example of evaluation a query against an output using the Farsight quality metric.
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."
quality_score = farsight.quality_score(query, output)print("score: ", quality_score)# score: 4
Create Your First Custom Metric
Generate a custom metric to evaluate your LLM outputs. A custom metric will return True if the provided constraint is violated, and False if the ouptut complies with the provided constraint.
Below is an example of defining and measuring a custom metric that measures two independent constraints.
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."
# Replace this with the actual constraints you want to check your LLM output forconstraints = ["do not mention Joe Biden","do not talk about alcohol"]custom_metric = farsight.custom_metrics(constraints, output)print("score: ", custom_metric)# score: [True, False]
Auto-Generate High-Quality Potential System Prompts
Don't want to waste time trying out different system prompts? Farsight automatically generates great candidates and allows you to quantitatively compare them using standard and custom metrics with minimal effort. Simply describe the use case of your application and seamlessly generate multiple system prompts to evaluate.
from farsightai import FarsightAI# Replace this with your openAI credentialsOPEN_AI_KEY ="<openai_key>"# Replace this with your use case detailsnum_prompts =2task ='You are a conversational chatbot'context ='The year is 2012'guidelines = ["Don't answer questions about Britney Spears"]farsight =FarsightAI(openai_key=OPEN_AI_KEY)generated_prompts = farsight.generate_prompts(num_prompts, task, context, guidelines)print("prompts: ", generated_prompts)# prompts: [# "You are a conversational chatbot from the year 2012. Your goal is to answer questions # based on your knowledge but without answering questions about Britney Spears. I'm here to help! Please # provide me with a question and the specific knowledge you want me to use for # answering it.",# "As a conversational chatbot in the year 2012, your goal is to answer questions # accurately and concisely. Your guidelines are to not answer any questions about Britney Spears. Please provide the necessary information for me to
# generate a response."# ]
Prompt Optimization
For prompt optimization, we offer two distinct approaches - one with manual oversight and one with full automation. Choose the one that aligns best with your use case, workflows and anticipated functionality:
Step by Step Approach: Generate multiple system prompts for evaluation and testing purposes. Tailor them based on context and optional system guidelines.
Fully Automated Approach: Leverage our comprehensive automated prompt optimization function. This feature not only generates prompts but also evaluates and iteratively improves them. It operates based on your shadow traffic, evaluation rubric, and optional ground truth outputs.
Want to integrate quickly? Try out Farsight AI on a colab notebook for automated prompt optimization here and step by step here.
We included the fully automated approach below:
from farsightai import FarsightAIfrom openai import OpenAI# Replace this with your openAI credentialsOPEN_AI_KEY ="<openai_key>"shadow_traffic = ["What are the current job openings in the company?","How can I apply for a specific position?","What is the status of my job application?","Can you provide information about the company's benefits and perks?","What is the company's policy on remote work or flexible schedules?","How do I update my personal information in the HR system?","Can you explain the process for employee onboarding?","What training and development opportunities are available for employees?","How is performance evaluation conducted in the company?","Can you assist with information about employee assistance programs or wellness initiatives?",]farsight.get_best_system_prompts(shadow_traffic, gpt_optimized=True)# Result:# [# PromptEvaluation(# score=4.666666666666667,# system_prompt="<SYS> Thank you for reaching out to our HR chatbot. How can I assist you with your HR-related queries while ensuring the protection ...""
# test_results=[# TestResult(# score=5,# input="What is the company's policy on remote work or flexible schedules? "# output="Our company recognizes the importance of work-life balance and understands that remote work or flexible schedules can contribute ..."
# ),# TestResult( ... ),# TestResult( ... ),# ]),# PromptEvaluation( ... ),# PromptEvaluation( ... )# ]
Suggested Starter Workflow
We suggest you start by generating a few system prompts via our generate prompts functionality, then start evaluating outputs using standard Farsight metrics. Follow the steps below:
(a) Generate a reasonable amount of system prompts (we suggest 5 to start)
(b) Start to generate outputs using an LLM of your choice (Mistral, OpenAI, Anthropic)
(c) Finally, evaluate using our metrics suite. We suggest starting with the standard Farsight metrics, and implementing custom metrics as needed for your evaluation.
We've provided an example of this suggested generation and evaluation process below.
from farsightai import FarsightAIfrom openai import OpenAI# Replace this with your openAI credentialsOPEN_AI_KEY ="<openai_key>"# specify your use case parametersnum_prompts =2task ='You are a financial chatbot'context ='The year is 2008'guidelines = ["Don't answer questions about the housing market."]# generate a few system prompts to evaluategenerated_prompts = farsight.generate_prompts(num_prompts, task, context, guidelines)print("generated_prompts: ", generated_prompts)client =OpenAI(api_key=OPEN_AI_KEY)# test a specific inputinput="What happened to the market in 2012"for system_prompt in generated_prompts:# for each system prompt generate an output chatCompletion = client.chat.completions.create( model="gpt-3.5-turbo", messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": input}, ], ) output = chatCompletion.choices[0].message.content knowledge =Noneprint("input: ", input)print("output: ", output)print("---------------metrics---------------")# evaluate the output factuality_score = farsight.factuality_score(input, output, knowledge)print("factuality_score: ", factuality_score)# factuality_score: true consistency_score = farsight.consistency_score(input, output)print("consistency_score: ", consistency_score)# factuality_score: 1.0 quality_score = farsight.quality_score(input, output)print("quality_score: ", quality_score)# quality_score: 3 conciseness_score = farsight.conciseness_score(input, output)print("conciseness_score: ", conciseness_score)# conciseness_score: 4