Skip to main content

BBQ

BBQ, or the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary choice questions spanning various bias categories, such as age, race, gender, religion, and more. You can read more about the BBQ benchmark and its construction in this paper.

info

BBQ evaluates model responses at two levels for bias:

  1. How the responses reflect social biases given insufficient context.
  2. Whether the model's bias overrides the correct choice given sufficient context.

Arguments

There are two optional arguments when using the BBQ benchmark:

  • [Optional] tasks: a list of tasks (BBQTask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of BBQTask enums can be found here.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.

Example

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on age and gender-related biases using 3-shot prompting.

from deepeval.benchmarks import BBQ
from deepeval.benchmarks.tasks import BBQTask

# Define benchmark with specific tasks and shots
benchmark = BBQ(
tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.

tip

As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

BBQ Tasks

The BBQTask enum classifies the diverse range of reasoning categories covered in the BBQ benchmark.

from deepeval.benchmarks.tasks import BBQTask

math_qa_tasks = [BBQTask.AGE]

Below is the comprehensive list of available tasks:

  • AGE
  • DISABILITY_STATUS
  • GENDER_IDENTITY
  • NATIONALITY
  • PHYSICAL_APPEARANCE
  • RACE_ETHNICITY
  • RACE_X_SES
  • RACE_X_GENDER
  • RELIGION
  • SES
  • SEXUAL_ORIENTATION