Standard Metrics

Evaluate your LLM outputs with standard Farsight metrics

Metrics are the basis of our open-source starter SDK.

Farsight AI's evaluation suite consists of 4 standard metrics: Factuality, Consistency, Quality, and Conciseness.

You are able to test your LLM systems with these metrics either individually or in combination. See the examples below for more information on the setup of each metric.

Make sure you have your OpenAI API Key before you begin.

Factuality

factuality_score()

Evaluate the factuality of your LLM's response and get a result of true (factual) or false (hallucination or generally unsubstantiated)

Param
Type
Description

query

str

instruction given to LLM

output

str

response from LLM

knowledge

str

optional: additional context to help evaluate factuality

Output Type
Output Definition

boolean

factuality of given query based on knowledge

(True means output is correct)

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."
knowledge = None # optional param to include additional knowledge

factuality_score = farsight.factuality_score(query, output, knowledge)
print("output: ", factuality_score)
# output: true

Consistency

consistency_score()

Evaluate your LLM's response consistency on a scale of zero to one, with zero being entirely inconsistent and one being entirely consistent.

Param
Type
Description

instruction

str

instruction given to LLM

response

str

response from LLM

num_samples

int

optional: amount of samples you want to evaluate the llm's consistency against. Increases num_samples increases latency The default value is 3.

Output Type
Output Definition

float

score from zero to one, with zero being entirely inconsistent and one being entirely consistent

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."

consistency_score = farsight.consistency_score(query, output)

print("score: ", consistency_score)
# score: 1.0

Quality

quality_score()

Evaluate your LLM's response quality on a score of 1-5 with 1 being subpar / unusable quality and 5 being exemplary, human-like quality.

Param
Type
Description

instruction

str

instruction given to LLM

response

str

response from LLM

Output Type
Output Definition

int

score from 1-5, 1 being subpar / unusable quality and 5 being exemplary, human-like quality

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."

quality_score = farsight.quality_score(query, output)
print("score: ", quality_score)
# score: 4

Conciseness

conciseness_score()

Evaluate your LLM response conciseness on a score of 1-5 from 1 being extremely verbose and 5 being very concise. Note that conciseness does not only capture word count, but also, how much information-rich each incremental word is.

Param
Type
Description

query

str

instruction given to LLM

output

str

response from LLM

Output Type
Output Definition

int

score from 1-5, 1 being extremely verbose and 5 being very concise

from farsightai import FarsightAI

# Replace with your openAI credentials
OPEN_AI_KEY = "<openai_key>"

query = "Who is the president of the United States"
farsight = FarsightAI(openai_key=OPEN_AI_KEY)

# Replace this with the actual output of your LLM application
output = "As of my last knowledge update in January 2022, Joe Biden is the President of the United States. However, keep in mind that my information might be outdated as my training data goes up to that time, and I do not have browsing capabilities to check for the most current information. Please verify with up-to-date sources."

conciseness_score = farsight.conciseness_score(query, output)
print("score: ", conciseness_score)
# score: 3

Last updated