Evaluate your LLM outputs with standard Farsight metrics
Metrics are the basis of our open-source starter SDK.
Farsight AI's evaluation suite consists of 4 standard metrics: Factuality, Consistency, Quality, and Conciseness.
You are able to test your LLM systems with these metrics either individually or in combination. See the examples below for more information on the setup of each metric.
Make sure you have your OpenAI API Key before you begin.
Factuality
factuality_score()
Evaluate the factuality of your LLM's response and get a result of true (factual) or false (hallucination or generally unsubstantiated)
Param
Type
Description
query
str
instruction given to LLM
output
str
response from LLM
knowledge
str
optional: additional context to help evaluate factuality
Output Type
Output Definition
boolean
factuality of given query based on knowledge
(True means output is correct)
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput ="As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."knowledge =None# optional param to include additional knowledgefactuality_score = farsight.factuality_score(query, output, knowledge)print("output: ", factuality_score)# output: true
Consistency
consistency_score()
Evaluate your LLM's response consistency on a scale of zero to one, with zero being entirely inconsistent and one being entirely consistent.
Param
Type
Description
instruction
str
instruction given to LLM
response
str
response from LLM
num_samples
int
optional: amount of samples you want to evaluate the llm's consistency against. Increases num_samples increases latency The default value is 3.
Output Type
Output Definition
float
score from zero to one, with zero being entirely inconsistent and one being entirely consistent
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput ="As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."consistency_score = farsight.consistency_score(query, output)print("score: ", consistency_score)# score: 1.0
Quality
quality_score()
Evaluate your LLM's response quality on a score of 1-5 with 1 being subpar / unusable quality and 5 being exemplary, human-like quality.
Param
Type
Description
instruction
str
instruction given to LLM
response
str
response from LLM
Output Type
Output Definition
int
score from 1-5, 1 being subpar / unusable quality and 5 being exemplary, human-like quality
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput ="As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."quality_score = farsight.quality_score(query, output)print("score: ", quality_score)# score: 4
Conciseness
conciseness_score()
Evaluate your LLM response conciseness on a score of 1-5 from 1 being extremely verbose and 5 being very concise. Note that conciseness does not only capture word count, but also, how much information-rich each incremental word is.
Param
Type
Description
query
str
instruction given to LLM
output
str
response from LLM
Output Type
Output Definition
int
score from 1-5, 1 being extremely verbose and 5 being very concise
from farsightai import FarsightAI# Replace with your openAI credentialsOPEN_AI_KEY ="<openai_key>"query ="Who is the president of the United States"farsight =FarsightAI(openai_key=OPEN_AI_KEY)# Replace this with the actual output of your LLM applicationoutput ="As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."conciseness_score = farsight.conciseness_score(query, output)print("score: ", conciseness_score)# score: 3