Getting Started
Improve and evaluate your LLM applications with a few simple steps
Get evaluating now! Follow a few simple steps to improve your LLMs.
Note: While you have the flexibility to assess the results of any Language Model (LLM), we specifically leverage OpenAI for the evaluation functions in Farsight AI. To utilize our package, you must have access to an OpenAI API Key.
Setup A Python Environment
Go to the root directory of your project and create a virtual environment (if you don't already have one). In the CLI, run:
python3 -m venv env
source venv/bin/activateInstallation
Install our library by running:
pip install farsightaiEvaluate Your First Metric
Utilize our evaluation suite to score your LLM outputs. Below is an example of evaluation a query against an output using the Farsight quality metric.
from farsightai import FarsightAI
# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"
query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)
# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."
quality_score = farsight.quality_score(query, output)
print("score: ", quality_score)
# score: 4Create Your First Custom Metric
Generate a custom metric to evaluate your LLM outputs. A custom metric will return True if the provided constraint is violated, and False if the ouptut complies with the provided constraint.
Below is an example of defining and measuring a custom metric that measures two independent constraints.
Auto-Generate High-Quality Potential System Prompts
Don't want to waste time trying out different system prompts? Farsight automatically generates great candidates and allows you to quantitatively compare them using standard and custom metrics with minimal effort. Simply describe the use case of your application and seamlessly generate multiple system prompts to evaluate.
Prompt Optimization
For prompt optimization, we offer two distinct approaches - one with manual oversight and one with full automation. Choose the one that aligns best with your use case, workflows and anticipated functionality:
Step by Step Approach: Generate multiple system prompts for evaluation and testing purposes. Tailor them based on context and optional system guidelines.
Fully Automated Approach: Leverage our comprehensive automated prompt optimization function. This feature not only generates prompts but also evaluates and iteratively improves them. It operates based on your shadow traffic, evaluation rubric, and optional ground truth outputs.
We included the fully automated approach below:
Suggested Starter Workflow
We suggest you start by generating a few system prompts via our generate prompts functionality, then start evaluating outputs using standard Farsight metrics. Follow the steps below:
(a) Generate a reasonable amount of system prompts (we suggest 5 to start)
(b) Start to generate outputs using an LLM of your choice (Mistral, OpenAI, Anthropic)
(c) Finally, evaluate using our metrics suite. We suggest starting with the standard Farsight metrics, and implementing custom metrics as needed for your evaluation.
We've provided an example of this suggested generation and evaluation process below.
Last updated