Generate Optimized Prompts

opro.generate_optimized_prompts(dataset, test_dataset=None, num_iterations=None, num_prompts_per_iteration=None, eval_function=None, sample_evaluations=None)

This function generates, assesses, and refines system prompts. It continuously enhances the quality of these prompts based on your specific datasets, ensuring the final output is an optimized system prompt tailored to your unique requirements.

Evaluation

To evaluate outputs and targets, by default, generate_optimized prompts determines the score by prompting GPT-3.5-Turbo as follows:

You are an LLM evaluator, given this input "{input}" and this target output
"{target}", is this output "{output}" correct?  Please answer yes or no.

To replace this evaluation method, please use the custom_eval_function parameter and see our example below.

Make sure you have your OpenAI API Key before you begin.

Param
Type
Description

dataset

list[dict]

Required: This should include pairs of "input" (the query or request) and "target" (the desired output). We recommend at least 50 samples for robust results, with a minimum of 3 samples needed.

See Dataset Configuration for more detail

test_dataset

list[dict]

Similar format as dataset. These samples are used only for testing the efficacy of the generated prompts (validation step).

num_iterations

int

Total number of learning iterations for prompt optimization. Default is 40.

num_prompts_per_iteration

int

The number of different prompts generated per learning iteration. Default is 8.

sample_evaluations

boolean

If True, includes the results of each sample evaluated in iterations. Default is False

custom_score_function

function

custom_score_function(input, output, expected_output, a custom scoring function that evaluates the output accuracy and returns a score between 0 and 1, with 1 being entirely accurate and 0 being entirely inaccurate.

Output Type
Output Definition

list[dict]

This includes at most the top 20 system prompts, depending on the iteration settings. Each entry contains:

  • "prompt": always included, The generated system prompt.

  • "score": always included, The average performance score of the prompt across the dataset.

  • "test_score": included only if a test_dataset is provided. Reflects the prompt's efficacy on the test set.

  • "sample_evals": included only if sample_evaluations is set to True. Contains evaluation results in the format list[dict], with "sample", "target", "output", and the corresponding "score".

Custom Score Function Example:

The following example illustrates a straightforward method for assessing whether a particular string appears in the output. For instance, in a dataset containing multiple-choice questions and answers, the desired string might be "(a)," while the output could be "(a) Joe Biden." To determine the accuracy of the output, we merely verify whether the target string is present within the output.

We offer the option to incorporate a custom scoring function because accuracy validation is frequently more intricate and tailored to specific requirements.

def custom_score_function(input, output, target):
    if target in output:
        return 1
    else:
        return 0

Last updated