Getting Started

Improve and evaluate your LLM applications with a few simple steps

Get evaluating now! Follow a few simple steps to improve your LLMs.

Want to integrate even quicker? Try out Farsight AI on a Colab notebook here.

Note: While you have the flexibility to assess the results of any Language Model (LLM), we specifically leverage OpenAI for the evaluation functions in Farsight AI. To utilize our package, you must have access to an OpenAI API Key.

Setup A Python Environment

Go to the root directory of your project and create a virtual environment (if you don't already have one). In the CLI, run:

python3 -m venv env
source venv/bin/activate

Installation

Install our library by running:

pip install farsightai

Evaluate Your First Metric

Utilize our evaluation suite to score your LLM outputs. Below is an example of evaluation a query against an output using the Farsight quality metric.

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."

quality_score = farsight.quality_score(query, output)
print("score: ", quality_score)
# score: 4

Create Your First Custom Metric

Generate a custom metric to evaluate your LLM outputs. A custom metric will return True if the provided constraint is violated, and False if the ouptut complies with the provided constraint.

Below is an example of defining and measuring a custom metric that measures two independent constraints.

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."
# Replace this with the actual constraints you want to check your LLM output for
constraints = ["do not mention Joe Biden", "do not talk about alcohol"]

custom_metric = farsight.custom_metrics(constraints, output)

print("score: ", custom_metric)
# score:  [True, False]

Auto-Generate High-Quality Potential System Prompts

Don't want to waste time trying out different system prompts? Farsight automatically generates great candidates and allows you to quantitatively compare them using standard and custom metrics with minimal effort. Simply describe the use case of your application and seamlessly generate multiple system prompts to evaluate.

from farsightai import FarsightAI

# Replace this with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

# Replace this with your use case details
num_prompts = 2
task = 'You are a conversational chatbot'
context = 'The year is 2012'
guidelines = ["Don't answer questions about Britney Spears"]
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

generated_prompts = farsight.generate_prompts(num_prompts, task, context, guidelines)
print("prompts: ", generated_prompts)

# prompts: [
#    "You are a conversational chatbot from the year 2012. Your goal is to answer questions 
#    based on your knowledge but without answering questions about Britney Spears. I'm here to help! Please 
#    provide me with a question and the specific knowledge you want me to use for 
#    answering it.",
#    "As a conversational chatbot in the year 2012, your goal is to answer questions 
#    accurately and concisely. Your guidelines are to not answer any questions about Britney Spears. Please provide the necessary information for me to 
#    generate a response."
# ]

Prompt Optimization

For prompt optimization, we offer two distinct approaches - one with manual oversight and one with full automation. Choose the one that aligns best with your use case, workflows and anticipated functionality:

  1. Step by Step Approach: Generate multiple system prompts for evaluation and testing purposes. Tailor them based on context and optional system guidelines.

  2. Fully Automated Approach: Leverage our comprehensive automated prompt optimization function. This feature not only generates prompts but also evaluates and iteratively improves them. It operates based on your shadow traffic, evaluation rubric, and optional ground truth outputs.

Want to integrate quickly? Try out Farsight AI on a colab notebook for automated prompt optimization here and step by step here.

We included the fully automated approach below:

from farsightai import FarsightAI
from openai import OpenAI

# Replace this with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

shadow_traffic = [
    "What are the current job openings in the company?",
    "How can I apply for a specific position?",
    "What is the status of my job application?",
    "Can you provide information about the company's benefits and perks?",
    "What is the company's policy on remote work or flexible schedules?",
    "How do I update my personal information in the HR system?",
    "Can you explain the process for employee onboarding?",
    "What training and development opportunities are available for employees?",
    "How is performance evaluation conducted in the company?",
    "Can you assist with information about employee assistance programs or wellness initiatives?",
]

farsight.get_best_system_prompts(shadow_traffic, gpt_optimized=True)
# Result:

# [
#     PromptEvaluation(
#         score=4.666666666666667,
#         system_prompt="<SYS> Thank you for reaching out to our HR chatbot. How can I assist you with your HR-related queries while ensuring the protection ...""
#         test_results=[
#               TestResult(
#                     score=5,
#                     input="What is the company's policy on remote work or flexible schedules?   "
#                     output="Our company recognizes the importance of work-life balance and understands that remote work or flexible schedules can contribute ..."
#               ),
#               TestResult( ... ),
#               TestResult( ... ),
#           ]),
#     PromptEvaluation( ... ),
#     PromptEvaluation( ... )
# ]

Suggested Starter Workflow

We suggest you start by generating a few system prompts via our generate prompts functionality, then start evaluating outputs using standard Farsight metrics. Follow the steps below:

(a) Generate a reasonable amount of system prompts (we suggest 5 to start)

(b) Start to generate outputs using an LLM of your choice (Mistral, OpenAI, Anthropic)

(c) Finally, evaluate using our metrics suite. We suggest starting with the standard Farsight metrics, and implementing custom metrics as needed for your evaluation.

We've provided an example of this suggested generation and evaluation process below.

from farsightai import FarsightAI
from openai import OpenAI

# Replace this with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

# specify your use case parameters
num_prompts = 2
task = 'You are a financial chatbot'
context = 'The year is 2008'
guidelines = ["Don't answer questions about the housing market."]

# generate a few system prompts to evaluate
generated_prompts = farsight.generate_prompts(num_prompts, task, context, guidelines)
print("generated_prompts: ", generated_prompts)

client = OpenAI(api_key=OPEN_AI_KEY)
# test a specific input
input = "What happened to the market in 2012"
for system_prompt in generated_prompts:
    # for each system prompt generate an output 
    chatCompletion = client.chat.completions.create(
        model="gpt-3.5-turbo",
         messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": input},
        ],
    )
    output = chatCompletion.choices[0].message.content
    knowledge = None
    print("input: ", input)
    print("output: ", output)
    print("---------------metrics---------------")

    # evaluate the output 
    factuality_score = farsight.factuality_score(input, output, knowledge)
    print("factuality_score: ", factuality_score)
    # factuality_score: true
    consistency_score = farsight.consistency_score(input, output)
    print("consistency_score: ", consistency_score)
    # factuality_score: 1.0
    quality_score = farsight.quality_score(input, output)
    print("quality_score: ", quality_score)
    # quality_score: 3
    conciseness_score = farsight.conciseness_score(input, output)
    print("conciseness_score: ", conciseness_score)
    # conciseness_score: 4

Last updated