ViT
- class lib.model.networks.clip.ViT(name: Literal['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B-16', 'ViT-B-32', 'ViT-L-14', 'ViT-L-14-336px', 'FaRL-B-16-16', 'FaRL-B-16-64'], input_size: int | None = None, load_weights: bool = False)
Bases:
objectVisiual Transform from CLIP
A Convolutional Language-Image Pre-Training (CLIP) model that encodes images and text into a shared latent space.
Reference
https://arxiv.org/abs/2103.00020
- param name:
“ViT-B-16”, “ViT-L-14”, “ViT-L-14-336px”, “FaRL-B_16-64”] The model configuration to use
- type name:
[“RN50”, “RN101”, “RN50x4”, “RN50x16”, “RN50x64”, “ViT-B-32”,
- param input_size:
The required resolution size for the model.
Nonefor default preset size- type input_size:
int, optional
- param load_weights:
Trueto load pretrained weights. Default:False- type load_weights:
bool, optional
Methods Summary
__call__()Get the configured ViT model
Methods Documentation
- __call__() Model
Get the configured ViT model
- Returns:
The requested Visual Transformer model
- Return type:
keras.models.Model
- __call__() Model
Get the configured ViT model
- Returns:
The requested Visual Transformer model
- Return type:
keras.models.Model
- Parameters:
name (TypeModels)
input_size (int | None)
load_weights (bool)