AI Glossary

Inference Engine: The types of processors used for AI inferencing predictions, such as CPUs, GPUs, and accelerators.

GenAI Models (LLMs): An AI model trained with extensive volumes of text data to generate and understand language.

  • First Token Latency: The time to generate the first token after receiving a prompt.
  • Second Token Latency: The time to generate each subsequent token.

 

Traditional AI (Vision Models): An AI model using classical algorithms to interpret images.

  • Throughput (Frames per Second [FPS]): The number of frames and images processed per second.
  • Latency (per Frame): The time to process each individual image.

 

Traditional AI (Natural Language Processing [NLP] Models): An AI model using rule-based or statistical methods for language tasks.

  • Latency: The speed at which the model processes a query.
  • Throughput (Queries per Second): The number of queries processed per second.

 

GenAI (Diffusion) Models: These AI models create data and content, like images, text, and more.

  • Image-Generation Latency: The time taken to generate an image from the input.
  • Throughput: The number of images generated per second.

FP32 (32-bit Floating Point): A high-precision format using 32 bits to represent real numbers that is widely used in early AI models and tasks requiring high accuracy.
 

FP16 (16-bit Floating Point): A lower-precision format (compared to FP32) that is often used to speed up computations and reduce memory use where the highest precision isn't needed.
 

BF16 (Bfloat16): A variant of 16-bit floating point with a larger range, like FP32, but less precision than FP16. Common in modern training and inference due to its efficiency and adaptability to large-scale models.
 

Int8 (8-bit Integer): A lower-precision format typically applied in inference to significantly speed computations while reducing memory and power requirements. Often used in edge and mobile devices, requiring quantization.
 

Int4 (4-bit Integer): A low-precision integer format sometimes used in lightweight inference applications where efficiency is prioritized over precision.