ViT

class lib.model.networks.clip.ViT(name: Literal['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B-16', 'ViT-B-32', 'ViT-L-14', 'ViT-L-14-336px', 'FaRL-B-16-16', 'FaRL-B-16-64'], input_size: int | None = None, load_weights: bool = False)

Bases: object

Visiual Transform from CLIP

A Convolutional Language-Image Pre-Training (CLIP) model that encodes images and text into a shared latent space.

Reference

https://arxiv.org/abs/2103.00020

param name:

“ViT-B-16”, “ViT-L-14”, “ViT-L-14-336px”, “FaRL-B_16-64”] The model configuration to use

type name:

[“RN50”, “RN101”, “RN50x4”, “RN50x16”, “RN50x64”, “ViT-B-32”,

param input_size:

The required resolution size for the model. None for default preset size

type input_size:

int, optional

param load_weights:

True to load pretrained weights. Default: False

type load_weights:

bool, optional

Methods Summary

__call__()

Get the configured ViT model

Methods Documentation

__call__() Model

Get the configured ViT model

Returns:

The requested Visual Transformer model

Return type:

keras.models.Model

__call__() Model

Get the configured ViT model

Returns:

The requested Visual Transformer model

Return type:

keras.models.Model

Parameters:
  • name (TypeModels)

  • input_size (int | None)

  • load_weights (bool)