Adversarial ExamplesAn adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object.
Distribution ShiftA distribution shift occurs when the data seen at deployment differs from the data used during training.
Saliency MapsA saliency map is a visualization that assigns an importance score to each part of an input.
Attribution MethodsAttribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output.
Mechanistic InterpretabilityMechanistic interpretability studies neural networks by treating them as learned computational systems.
Model EditingModel editing modifies a trained model so that it changes a specific behavior while preserving most other behaviors.