2.3 Interpretability Landscape
2.3.3 Taxonomy of Interpretability Methods
Interpretability techniques provide model interpretations that take the form of expla- nations. Hence, in this thesis, the broader class “interpretability” describes the inter- pretability techniques. Researchers differentiate between the interpretations based on whether they describe the model behavior [19, 112, 115, 118, 121, 126, 129–131]. Inter- pretability encompasses different aspects such as training algorithm, hyperparameter settings, weight representation, mathematical relationships between inputs and out- puts [112, 115]. Figure 2.9 provides the taxonomy of interpretability techniques that can be categorized through the following aspects: (i) Generating interpretations; (ii) Scope; (iii) Level of Evaluation; and (iv) Nature of the technique.
Generating Interpretations: Interpretation methods can be differentiated according to the interpretations they produce. The interpretability techniques can generate interpretations in the following ways:
28 TH-2764_156201001
2. BACKGROUND
Global, Holistic Interpretability Interpretability
Nature
Scope Evaluation
Interpretations
Intrinsic/
Model-Specific Posthoc/
Model-Agnostic Algorithmic
Transparency
Global, Modular Interpretability Local Interpretable:
Single Prediction Local Interpretable:
Multiple Predictions
Dilettante Validation Expert Validation
Proxy Model Validation Feature Summary
Visualization Feature Summary
Statistics
Trained Model Weights Training Data
Instances Intrinsically Interpretable
Explanations In the form of
Figure 2.9: Taxonomy of interpretability techniques.
1. Feature Summary Statistics: A statistical summary for input features denotes individual feature importance, or pairwise feature correlation [125].
2. Feature Summary Visualization: The statistical summary could be visualized to provide better insight into the feature contribution responsible for the pre- diction. For instance, partial dependence plots [132], shapley Additive expla- nations [133], local interpretable model-agnostic explanations [123].
3. Trained Model Weights: The intrinsically interpretable models fall in this cat- egory. A few examples are linear model weights, the features and thresholds at nodes of decision trees, visualization of convolution filter maps in CNN [125].
4. Training Data Instances: The techniques producing data instances (either train- ing data samples or a combination of training data samples) to make a model interpretable. For example, an explainable deep neural network that generates explainable prototypes to make the model interpretable [31, 134]. The method is suitable for computer vision and the natural language processing domain but provides less valuable insight for data with large numerical and categorical features.
5. Intrinsically Interpretable: Black box models can be interpreted through ap- proximation using a piece-wise linear or an interpretable model. The inter- pretable model could be interpreted either through learned weights or sum- TH-2764_156201001
2.3. INTERPRETABILITY LANDSCAPE
marised feature statistics [125].
Scope of Technique: The interpretations provided by the technique could be local or global or somewhere in between local and global [19, 20, 29, 114, 120, 125, 135–
137]. The global interpretations provide an understanding of the logic that the model follows and a generalized reasoning leading towards different outcomes [114]. Global interpretations provide insight on how a model is working, where the insight includes training information, functional level performance such as F1 score, and failure modes, i.e., where the model fails to predict correctly [137]. Although the interpretations have completeness, they suffer from a lack of comprehensiveness. On the other hand, local or instance level interpretations are specific model outputs that are comprehensive, but lack completeness [114,125,137]. Global interpretations describe model processes, and local interpretations describe model behavior. Hall et al. [135] suggested that best interpretations are a combination of global and local interpretations. Following are the major scopes defined in the literature:
1. Algorithmic Transparency: The scope highlights the transparency of the model’s learning procedure using the training data and relationships learned during training. In CNN, the convolution filter visualization explains the type of de- tectors learned but not how individual predictions are made. Linear models are transparent, but gradient backpropagation of DLM is less transparent [112].
2. Global, Holistic Model Interpretability: If the overall model decision-making is comprehensive through a holistic view of input features, trained model weights, and model architecture, the model is interpretable at a Global level [23]. It helps to understand the distribution of target outcomes based on the features.
However, it is difficult to achieve as a large number of model parameters make it difficult for humans to comprehend, as a larger dimension space cannot be understood by humans [112].
3. Global Modular Interpretability: Interpreting a large number of parameters could be performed by understanding the individual weights or a group of weights at a modular level to achieve global model interpretability. However, models might not be interpretable at a parameter level. For instance, the split and leaf nodes are helpful for decision trees, the weights are beneficial for linear models [112].
4. Locally Interpretable for Single Prediction: The models can be interpreted lo- cally for the individual prediction that may depend on a few features. Local explanations could be more accurate than global explanations [112]. This the-
30 TH-2764_156201001
2. BACKGROUND
sis focuses on interpreting individual predictions for model-agnostic posthoc methods and model-specific implicit methods.
5. Locally Interpretable for Multiple Predictions: A group of predictions could also be explained with global and modular methods or instance-based explanations.
Here, the group of instances is treated as if they represent the complete dataset followed by individual explanation methods for each instance followed by ag- gregation of explanations for the group [112].
Evaluating Interpretability Techniques: Although measuring interpretabil- ity is an active research field, a real consensus about interpretability in machine learning and evaluation metric is still lacking. However, there are a few evaluation approaches proposed in the literature [29, 125]:
1. Expert Validation: The technique embedded in a product performing a real task is validated by experts. For instance, an ECG classification system will generate heatmaps highlighting relevant signal timestamps for a diagnosis, and the cardiologist will evaluate the model.
2. Dilettante Validation: The technique is validated by dilettante/non-experts.
The advantage is that no domain expert is required and could be validated by several users. For instance, users can choose the best explanation from various explanations.
3. Proxy Model Validation: The technique adopts a proxy function/model that is already validated by non-experts. Suppose users understand decision trees, then tree depth can be provided as a proxy for explanation where short trees get a high explainability score.
Nature of Interpretability Techniques: The techniques can be applied to a model after or during the model building. The nature of the techniques becomes intrinsic if they are part of the model architecture and posthoc if they are applied later to a pretrained model. The intrinsic and posthoc techniques are the main focus of the thesis and are discussed in detail in Section 2.3.4.