0%

SAM -- Identifying the Most Significant Parameters for Fine-Tuning

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

Fine-tuning a pre-trained model is currently the dominant approach in contemporary natural language processing research. However, the choice of parameters to adjust can significantly influence model performance. The challenge lies in pinpointing the most important parameters for fine-tuning. Numerous Parameter-Efficient methods like Adapter, LoRA, DiffPruning, ChildPruning employ various strategies to tweak model parameters. This blog post delves into our recent paper titled “On the effectiveness of parameter-efficient fine-tuning” by Zihao Fu et al. [1], where we introduce a novel method, the Second-order Approximation Method (SAM), to effectively identify the most critical parameters for fine-tuning.

The Second-order Approximation Method (SAM)

In our paper, we’ve demonstrated the effectiveness of sparsity in fine-tuning. However, the question of how to select the tunable parameters remains. Traditional methods such as random and rule-based methods are robust to noise perturbation since the tunable parameters remain fixed during training. However, they do not utilize task-specific data and may not be optimal. Conversely, projection-based methods fully utilize data but suffer from projection discontinuity issues, leading to an unstable optimization process.

To overcome these challenges, we propose the Second-order Approximation Method (SAM). SAM leverages data information to inform the parameter mask selection, while avoiding the projection discontinuity issue. We don’t select parameters randomly or by simple rules; instead, we introduce a novel second-order approximation of the optimization problem to make the target analytically solvable.

Previous research indicates that the fine-tuned parameters are close to the pre-trained parameters. Thus, we can approximate the loss function with its second-order Taylor expansion. However, calculating the Hessian matrix, which this approximation requires, is computationally expensive, especially for large neural models. To address this, we approximate the Hessian matrix with a diagonal matrix, assuming that it is positive semidefinite since the pre-trained weights are close to the global minimizer in each downstream task.

This approach leads us to a reformulated problem:

minΔθL(θ0)+L(θ0)TMΔθ+12(MΔθ)THMΔθs.t.    M0=mp;    Mij=0,ij;    Mii{0,1}.\begin{aligned} \min_{\Delta \theta} L(\theta^0) +& \nabla L(\theta^0)^{\mathrm T} M\Delta \theta + \frac{1}{2} (M\Delta \theta)^{\mathrm T} H M\Delta \theta\\ s.t. \ \ \ \ \|M\|_0=\lfloor mp&\rfloor;\ \ \ \ M_{ij}=0,\forall i\ne j; \ \ \ \ M_{ii}\in \{0,1\}. \end{aligned}

With this setup, we can derive the optimal parameter mask MM based on a theorem that states selecting features according to this theorem achieves the minimal value of the approximation in the problem.

However, computing the diagonal Hessian is as complex as calculating the entire Hessian. To mitigate this, we propose to optimize the upper bound of the target function in the problem, where D=diag{λmax,λmax,,λmax}D=\text{diag}\{|\lambda_{max}|, |\lambda_{max}|, \cdots, |\lambda_{max}| \} and λmax\lambda_{max} is the maximal eigenvalue of HH.

minΔθL(θ0)+L(θ0)TMΔθ+12(MΔθ)TDMΔθs.t.    M0=mp;    Mij=0,ij;    Mii{0,1}.\begin{aligned} \min_{\Delta \theta} L(\theta^0) +& \nabla L(\theta^0)^{\mathrm T} M\Delta \theta + \frac{1}{2} (M\Delta \theta)^{\mathrm T} D M\Delta \theta\\ s.t. \ \ \ \ \|M\|_0=\lfloor mp&\rfloor;\ \ \ \ M_{ij}=0,\forall i\ne j; \ \ \ \ M_{ii}\in \{0,1\}. \end{aligned}

This leads to the straightforward SAM algorithm. We first perform a full fine-tuning step, calculate the gradient for each parameter, determine the square of its magnitude, and select the top parameters for optimization. Then, we fine-tune the model again, and only the selected parameters will be tuned during the new optimization procedure.

In Conclusion

In summary, our Second-order Approximation Method (SAM) offers a novel and effective technique to identify the most crucial parameters for fine-tuning in machine learning models. By leveraging data and avoiding projection discontinuity, SAM provides a promising direction for fine-tuning optimization. For a more detailed understanding of this concept, please refer to our original paper, On the effectiveness of parameter-efficient fine-tuning[1:1], or visit the corresponding Github Homepage.


  1. Fu, Z., Yang, H., So, A. M., Lam, W., Bing, L., & Collier, N. (2023). On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 12799-12807). ↩︎ ↩︎