ViT

class lib.model.networks.clip.ViT(name: Literal['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B-16', 'ViT-B-32', 'ViT-L-14', 'ViT-L-14-336px', 'FaRL-B-16-16', 'FaRL-B-16-64'], input_size: int | None = None, load_weights: bool = False)

Bases: object

Visiual Transform from CLIP

A Convolutional Language-Image Pre-Training (CLIP) model that encodes images and text into a shared latent space.

Reference

https://arxiv.org/abs/2103.00020

param name:: “ViT-B-16”, “ViT-L-14”, “ViT-L-14-336px”, “FaRL-B_16-64”] The model configuration to use
type name:: [“RN50”, “RN101”, “RN50x4”, “RN50x16”, “RN50x64”, “ViT-B-32”,
param input_size:: The required resolution size for the model. None for default preset size
type input_size:: int, optional
param load_weights:: True to load pretrained weights. Default: False
type load_weights:: bool, optional

Methods Summary

__call__()

Get the configured ViT model

Methods Documentation

__call__() → Model

Get the configured ViT model

Returns:: The requested Visual Transformer model
Return type:: keras.models.Model

__call__() → Model

Get the configured ViT model

Returns:: The requested Visual Transformer model
Return type:: keras.models.Model

Parameters:

name (TypeModels)
input_size (int | None)
load_weights (bool)