본문 바로가기

AI상식 w.GPT

GmP CLIP MLP. (clip finetuning)

GmP CLIP MLP란게 있다고 한다. text encoder로 저걸 쓰는 유투버들이 있길래 검색해보았다. 

 

기존 CLIP MLP는 weight value를 직접 조정하는데, 

value를  r , theta 로 나타내고, 이 값을 조정함으로써, 

1. 안정성,2.기하 정보, 3. 일반화 등등이 더 좋아진다고 한다. 

 

https://github.com/zer0int/CLIP-txt2img-diffusers-scripts

 

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

 

 

 

ChatGPT의 말:

ChatGPT

GmP CLIP MLP (Geometric Parameterization CLIP MLP) introduces a significant modification compared to the "Normal" CLIP MLP in how the weights of the model's linear layers are represented and optimized. Let's break down the key differences and the potential benefits of the GmP approach.

1. Normal CLIP MLP

In the normal CLIP MLP structure, weights are simply represented as standard matrices (i.e., weight matrices), and the training optimizes these weights directly. Each linear layer in the MLP has a weight matrix that transforms the input features:

  • c_fc layer: Transforms the input from a lower-dimensional space to a higher-dimensional space (1024 to 4096 dimensions).
  • c_proj layer: Projects the features back from the higher-dimensional space to the original lower-dimensional space (4096 to 1024 dimensions).

These linear transformations are accompanied by biases and a non-linear activation function (like GELU). The key idea here is that the optimization happens on the weight values directly.

2. GmP CLIP MLP (Geometric Parameterization)

In GmP CLIP, the approach to representing and updating weights is different. Instead of using standard weight matrices, GmP decomposes the weight matrix into two components:

  • Radial component (r): This represents the magnitude (norm) of the weights.
  • Angular component (theta): This represents the direction (unit-normalized direction of the weight vector).

So, instead of optimizing the weights directly, GmP optimizes these two components separately. Here's what happens:

  • Norm Preservation: The radial component 'r' controls the scale or magnitude of the weights, while the angular component 'theta' focuses on the directionality of the weight vectors.
  • Directionality Preservation: The direction (angular component) is maintained while the model trains, which allows GmP to preserve the geometric structure of the weight vectors in the weight space.

This method preserves the weight vectors' directionality and magnitude, which can improve the generalization ability of the model by focusing on geometric properties.

3. What are the benefits?

The GmP approach offers several potential advantages:

  • Improved Stability in Training: By separating the magnitude and direction of the weight vectors, GmP can make the optimization process more stable. Instead of optimizing over a single weight space, the model learns the optimal direction and magnitude independently, which can help avoid some of the instability in the weight space.
  • Preservation of Geometric Information: By maintaining the directionality of the weights, GmP can better capture the geometric relationships between features. This is particularly important in models like CLIP, which rely on aligning representations of images and text in a shared latent space.
  • Better Generalization: The radial/angular decomposition can result in better generalization performance because it prevents overfitting to the training data by ensuring that the model doesn't overly modify the directional structure of the weights. Instead of purely focusing on fitting the data, the model is encouraged to learn the right directions in the feature space, which often generalizes better to new data.
  • Potential for Sparser Representations: Because the radial component controls the magnitude, it's possible for GmP to result in sparser weight matrices where smaller magnitude directions (or irrelevant directions) are minimized, leading to a more compact and efficient representation.

4. In Summary:

  • Normal CLIP MLP: Optimizes the weight matrices directly (magnitude and direction together).
  • GmP CLIP MLP: Decomposes weights into two parts—magnitude and direction—and optimizes them separately, leading to more stable and geometrically meaningful weight updates.

The geometric preservation in GmP can result in better performance, particularly when it comes to tasks like multimodal representation learning, where alignment of different types of data (e.g., images and text) is crucial.

'AI상식 w.GPT' 카테고리의 다른 글

SigLIP 리뷰  (0) 2024.12.06
Attention 모듈 설명 (매우 쉽게 설명)  (0) 2024.10.28
[핵심요약?] On Distillation of Guided Diffusion Models  (1) 2024.10.09
SNR in Diffusion (w. GPT)  (0) 2024.10.03
ComfyUI 구조 리뷰  (0) 2024.09.20