As the omnipresent machine learning models play increasingly important roles in our society, powerful interpretation tools to uncover their black boxes are needed.
On the other hand, proven by psychological study, we humans are more likely to learn new concepts presented with contrastive instances.
Therefore, interpreting ML models using the contrast between the original data instance and its counterfactuals has become a popular problem.
Traditional counterfactual interpretation approaches tend to generate counterfactuals faithful to the ML model.
However, they have little or no constraint on the meaningfulness of generated counterfactuals.
This thesis proposes an approach generating a meaningful counterfactual interpretation of text classification models constrained with cosine similarity and POS (part-of-speech) properties of tokens.
In this thesis, I use the text CNN model based on Kims Cnn\cite{KimsCnn} with fine-tuned Word2Vec embedding layer as the model to interpret.
Then for the counterfactual generation, I leverage token-level HotFlip\cite{hotflip} and replace tokens under several constraints.
Lastly, I will present that my approach results in more meaningful counterfactual interpretations compared with the vanilla HotFlip approaches using several examples.
|