<- Back to main pagexx/xx/2024

20% faster CPU based LLM interference with a Vector Index on the output embeddings.

This is an updated and expanded version of the blog post Accelerate GPT Output Embedding computations with a Vector Index.

Motivation

The input and output embeddings of transformer-based LLMs are massive. For example, GPT-2's embeddings make up 38.5M (50257 x 768) out of 124M total parameters (~30%).

To generate text with these models, these embeddings are used to calculate the probabilities from what we sample the next token. A popular sampling method is top-k sampling, which samples from the k most probable tokens, preventing the selection of highly unlikely tokens.

When doing interference with top-k sampling for GPT2, we're only intrested in the 50 (huggingface default for k) most likely tokens out of 50,257 possible tokens - less than 0.1%. That raises the question, is there a shortcut to get only these 50 tokens without calculating logits for all 50k tokens?

That's the same problem faced by vector databases, where we use vector indices to speed up the search process.

Vector Indices

Vector indices come in two main groups:

Clustering based: IVF (Inverted File Index)

IVF struggles with the high dimensionality of the embeddings and can fail when the most probable tokens are outside the searched cluster, leading to incorrect predictions.

Graph based: HNSW (Hierarchical Navigable Small World)

HNSW performs well with the high-dimensional embeddings and offers a higher throughput. The long index build times aren't an issue for inference and the performance can be tuned using the index parameters ef and M.

I tried both, but only got good results for HNSW.

HNSW Layer

The HNSW layer is built on top of hnswlib and outputs the indices and logits of the most probable k tokens, directly from the last hidden state.

The time complexity of search on a HNSW index is O(log(n)).

To integreate into the current ecosystem that expect the logits instead of the topk elements, we can allocate a tensor in the shape of the logits filled with -inf (because exp(-inf)=0) and replace the topk indices with their corresponding logits.

class HNSWIndexEmbedding():
    def __init__(self, weight, M=32, ef=100, ef_construction=100):
        self.index = hnswlib.Index(space='ip', dim=weight.shape[1])
        self.index.init_index(max_elements=weight.shape[0], M=M, ef_construction=ef_construction)
        self.index.add_items(weight)
        self.index.set_ef(ef)

    def query(self, x, k):
        indices, distances = self.index.knn_query(x, k)
        return 1 - distances, indices

Performance

Graph based vector indices aren't easily paralellizable on the GPU, so for now we focus on CPU interference. I tried CAGRA, a GPU index, but couldn't get a performance improvement for small to medium batch sizes (more work needs to be done here).

The main application for this is accelerating on Decive Interference on CPU based systems.

Output Layer Performance

The performance measurements are done on the CPU for GPT-2's output embeddings, comparing the HNSW layer (with parameters k = 50, M = 32, ef = 100/200 ef_construction = 150) to the matrix multiplication between the final hidden state and the output embedding matrix (torch/cpu on Intel Xeon @ 2.20GHz, Google Colab):

BSpeedup ef = 100Speedup ef = 200
141.2x23.0x
89.0x4.4x
644.9x2.5x
2563.9x2.0x

Full Model Performance

Average interference times and speedup for text generation from 8 runs (on Intel i7-10750H CPU @ 2.60GHz, M=32, ef=150, ef_construction=5000, on Llama 3B ef_construction=500):

ModelTokensVec timeRef timeSpeedup
gpt2641.88s2.52s1.34
gpt21283.78s4.96s1.31
Llama-3.2-1B6417.45s21.69s1.24
Llama-3.2-1B12835.36s43.44s1.23
Llama-3.2-3B6452.7s59.6s1.13
Llama-3.2-3B128100.1s114.4s1.15

The speedups are directly collelated to the size of the embedding matrix relative to the total parameter count:

ModelEmbedding paramsSpeedup
gpt230%32%
Llama-3.2-1B21%23%
Llama-3.2-3B12%14%

You can run this benchmark with this notebook.

Prediction Quality

Due to the approximate nature of HNSW, vector index-based embedding layer can't guarantee the exact top-k elements and may occasionally miss a token. However, since we randomly sample from these tokens, missing one token might not be a major issue, as long as the most important ones are included.

The accuracy of these tokens can be influenced by the HNSW parameters M and ef, where higher values correspond to better accuracy but also higher search times.

Indexes with a high ef_construction take a long time to generate, but that's not really a problem since they only need to be generated once and can be distributed with the model.

To quantify how far off the vector index is, we measure the percentage of tokens where the overlap of the exponential sum from the top-k probabilities predicted with the vector index (M=32) compared to the true top-k is under a threshhold. So a 1% error for the threshhold of 0.5 means that for every 100th token sampled from the vector index, the distribution the token is sampled from is off by over 50% from the true distribution.

ef_constef<0.50<0.75<0.90
2001001.30%3.00%9.31%
2001500.78%1.63%4.95%
2002000.55%1.09%3.38%
50001000.48%1.25%4.47%
50001500.18%0.45%1.66%
50002000.10%0.22%0.92%

I didn't notice a degradation in the quality of the generated text.

Training

Training on the top k tokens doesn't really work, because the softmax maximize the probability of the correct token, by also mimizing the probability of all other tokens.

Having only a few tokens to minimize results in slow training and higher loss. Only with k values of around 10% of the vocab size, the training works comparable well to the full vocab, but then there aren't any performance benifits using a vector index. The vector index also needs to be rebuilt periodically because it's not differentiable.

Conclusion

Using a vector index on the output embeddings can drastically speed up the interference on the CPU based systems. With the ever increasing size of LLM's and the benifits of larger vocabularities and an increased use of interference on the edge, the use of vector are a good way to increase interference perfomance.

To realize these performance gains, the next step is to make a hacky integration into llama.cpp.

Accelerate.


Source code for this blog post is in this git repo: martinloretzzz/vector-index-layer

You can comment on this blog post here.

For updates follow me on 𝕏/twitter: martinloretzzz

I'm available for opportunities focused on solving AGI and all the challenges along the way, especially interested in architectures, frameworks and performance optimizations like this one. Feel free to reach out: work@martinloretz.com.

© 2025 Martin Loretz - Imprint - Privacy