This article summarises Dataset Attribution methods listed here.
The original application of Influence Function (IF) in tracing predictions of black-box models (especially neural networks) all the way to the training dataset comes from one of the best paper [P&L] of ICML 2017**.**
In their paper, they 1) re-examine the definition of this old concept from robust statistics, and derive its modern form in the context of neural networks; 2) propose methods to scale IF up; 3) and demonstrate its power on explaining linear models and convolutional neural networks.
“We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information”, in the paper’s abstract.
IF captures the core idea of studying models through the lens of their training data.
The motivations of [P&L] are like many follow-up interpretability works, i.e. doing so can a) improve model, b) discover new science, and c) help user trust the black-box model.
The authors distinguish this work from others that explain how a fixed/steady model leads to particular predictions —
“e.g. by locally fitting a simpler model around the test point or by perturbing the test point to see how the prediction changes”
Beyond that, the authors want to answer the research question — how can we explain where the model came from?
Formally, they attempt to measure how weights of the model change if certain training point is removed or changed slightly.
Since re-training the model after such intervention is computationally expensive, the authors resort to IF, a classic technique from robust statistics, which measures how the model parameters change as we upweight a training point by an infinitesimal amount.
“This allows us to ‘differentiate through the training’ to estimate (in closed-form) the effect of a variety of training perturbations” e.g. removing a training point or perturbing a training point.
In the original paper, [P&L] derives two kinds of principled analysis settings: