Diversity Quality (DQ) score

The Diversity Quality score measures how close the diversity on the in-distribution (ID) and out-of-distribution (OOD) datasets is to the ideal diversity. On the ID set, diversity should be small, while on the OOD set, diversity should be large. The DQ score is the harmonic mean of (1 - ID diversity) and OOD diversity. The \(DQ_1\)-score is calculated as follows:

\[DQ_1 = 2 \cdot \frac{(1 - IDD) \cdot OODD}{(1 - IDD) + OODD}\]

The \(DQ_1\) score can also be generalized to a \(DQ_\beta\) score, valuing one of ID diversity or OOD diversity more than the other. With \(\beta\) a positive real factor, OODD is considered \(\beta\) times as important as IDD:

\[DQ_\beta = (1 + \beta^2) \cdot \frac{(1 - IDD) \cdot OODD}{\beta^2 \cdot (1 - IDD) + OODD}\]

import reject
from reject.utils import generate_synthetic_output
from reject.diversity import diversity_quality_score, diversity_score

print(reject.__version__)

0.3.2

Generate synthetic NN output

In this example, we generate synthetic outputs of a NN with multiple samples of the predictive distribution. The output predictions are of shape (n_observations, n_samples, n_classes) and the true labels (n_observations,). The data generation function uses 10 output classes.

NUM_SAMPLES = 10
NUM_OBSERVATIONS = 1000

(y_pred_id, y_true_id), (y_pred_ood, y_true_ood) = generate_synthetic_output(
    NUM_SAMPLES, NUM_OBSERVATIONS, concat=False
)
print(y_pred_id.shape, y_true_id.shape)
print(y_pred_ood.shape, y_true_ood.shape)

(1000, 10, 10) (1000,)
(1000, 10, 10) (1000,)

Diversity score

We first calculate the diversity scores on the ID and OOD sets. The diversity score is calculated as the fraction of test data points on which predictions of ensemble members disagree. As the base model for diversity computation, we average the output distributions over the members and determine the resulting predicted label.

The diversity_score functions directly takes the predictions. You can choose the get the diversity for each member or the average diversity.

# ID set - diversity for each member
div_score = diversity_score(y_pred=y_pred_id, average=False)
print(div_score)

# ID set - average diversity
div_score = diversity_score(y_pred=y_pred_id, average=True)
print(div_score)

[0.404 0.401 0.444 0.446 0.438 0.44  0.442 0.4   0.398 0.452]
0.4265

# OOD set - diversity for each member
div_score = diversity_score(y_pred=y_pred_ood, average=False)
print(div_score)

# OOD set - average diversity
div_score = diversity_score(y_pred=y_pred_ood, average=True)
print(div_score)

[0.671 0.648 0.659 0.676 0.637 0.651 0.672 0.675 0.662 0.625]
0.6576

We observe that the diversity scores on the OOD set are higher than the diversity scores on the ID set. This is expected and is desired in real-life applications.

\(DQ_1\)-score

Based on the ID and OOD diversities, we calculate the \(DQ_1\)-score.

The diversity_quality_score function directly takes in the ID and OOD predictions. You can choose the get the diversity for each member or the average diversity. By default, the \(DQ_1\)-score is calculated, which gives equal weight to the ID and OOD diversity.

# diversity quality for each member
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, average=False
)
print(dq_score)

# average diversity quality
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, average=True
)
print(dq_score)

[0.63128019 0.62253729 0.60313416 0.60894959 0.5971543  0.60208092
 0.60971707 0.63529412 0.63057595 0.58397272]
0.6126774429372105

\(DQ_\beta\)-score

By adapting the beta_ood parameter, we can assign a higher weight to either ID or OOD diversity. For example, for beta_ood=2, the OOD diversity is considered twice as important as the ID diversity. Conversely, for beta_ood=0.5, the ID diversity is considered twice as important as the OOD diversity.

# beta = 2.0
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, beta_ood=2.0, average=True
)
print(dq_score)

# beta = 0.5
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, beta_ood=0.5, average=True
)
print(dq_score)

0.6388629895649817
0.5885539498735916