What happens to inference performance with LoRA?

Unlock all questions

This demo includes only 20 questions. Upgrade to access hundreds of questions, flashcards, exam simulations, and disable ads.

Full question bankExam simulationsFlashcards

From $25.99Unlock all

Study for the Hugging Face Agent Certification. Prepare with interactive quizzes and multiple-choice questions, complete with explanations and hints. Ace your exam!

Multiple Choice

What happens to inference performance with LoRA?

LoRA trains by adding small, low-rank adapters to the model’s layers, creating a modest, trainable adjustment ΔW to the original weights W (often written as ΔW = α A B). The key idea is that you can fold this adjustment into the base weights for deployment, turning W into W' = W + α A B and effectively removing the extra adapter path during inference. Once merged, the forward pass uses a single weight matrix, just like the original model, so the runtime latency remains the same as, or essentially equal to, the base model. This fusion is why the best answer emphasizes no additional latency after merging. If you kept the adapters separate, you’d incur additional computations from the extra matrices, but the standard practice is to merge for efficient inference.

What happens to inference performance with LoRA?

Study for the Hugging Face Agent Certification. Prepare with interactive quizzes and multiple-choice questions, complete with explanations and hints. Ace your exam!

What happens to inference performance with LoRA?

Get the latest from Passetra