内卷地狱

Wang Shusen Recommender Systems Study Notes — Retrieval

Edit Me

Wang Shusen Recommender Systems Study Notes — Retrieval

Retrieval

Item-Based Collaborative Filtering (ItemCF)

Basic Idea

If a user likes item1item_1, and item1item_1 is similar to item2item_2, then the user is likely to also like item2item_2.

ItemCF Implementation

User's interest in an item: like(user,itemj)like(user, item_j)

Similarity between items: sim(itemj,item)sim(item_j, item)

Estimated user interest in a candidate item: jlike(user,itemj)×sim(itemj,item)\sum_j like(user, item_j) \times sim(item_j, item)

Item Similarity

Computing item similarity (considering degree of user preference)

Let users who like item i1i_1 form set W1\mathcal{W}_1

Let users who like item i2i_2 form set W2\mathcal{W}_2

Define intersection V=W1W2\mathcal{V} = \mathcal{W}_1 \cap \mathcal{W}_2

Similarity between the two items:

sim(i1,i2)=VW1W2.sim(i_1, i_2) = \frac{|\mathcal{V}|}{\sqrt{|\mathcal{W}_1| \cdot |\mathcal{W}_2|}}.

Computing item similarity (weighted by degree of user preference)

Let users who like item i1i_1 form set W1\mathcal{W}_1

Let users who like item i2i_2 form set W2\mathcal{W}_2

Define intersection V=W1W2\mathcal{V} = \mathcal{W}_1 \cap \mathcal{W}_2

Similarity between the two items:

sim(i1,i2)=vVlike(v,i1)like(v,i2)u1W1like2(u1,i1)u2W2like2(u2,i2)sim(i_1, i_2) = \frac{\sum_{v \in \mathcal{V}} like(v, i_1) \cdot like(v, i_2)} {\sqrt{\sum_{u_1 \in \mathcal{W}_1} like^2(u_1, i_1)} \cdot \sqrt{\sum_{u_2 \in \mathcal{W}_2} like^2(u_2, i_2)}}

Full ItemCF Retrieval Workflow

Offline Precomputation

Build "user → item" index

  • Record item IDs the user has recently clicked or interacted with.
  • Given any user ID, retrieve the list of items they have recently shown interest in.

Build "item → item" index

  • Compute pairwise similarity between items.
  • For each item, index the kk most similar items.
  • Given any item ID, quickly find its kk most similar items.

Online Retrieval

  1. Given a user ID, use the "user → item" index to find the list of items the user has recently interacted with (last-n).
  2. For each item in the last-n list, use the "item → item" index to find the top-k similar items.
  3. For the retrieved similar items (up to nknk total), use the formula to estimate user interest scores.
  4. Return the top 100 highest-scoring items as recommendation results.

Using indexes, offline computation is large but online computation is small.

Summary

ItemCF Principle

User likes item i1i_1; then the user likes item i2i_2 which is similar to i1i_1.

Item similarity:

  • If users who like i1i_1 and i2i_2 have large overlap, then i1i_1 and i2i_2 are similar.

  • Formula:

sim(i1,i2)=W1W2W1W2sim(i_1, i_2) = \frac{|\mathcal{W}_1 \cap \mathcal{W}_2|}{\sqrt{|\mathcal{W}_1| \cdot |\mathcal{W}_2|}}

ItemCF Retrieval Channel

Maintain two indexes:

  • User → item list: nn items the user has recently interacted with.
  • Item → item list: kk items with highest similarity.

Online retrieval:

  • Use both indexes to retrieve up to nknk items per request.
  • Estimate user interest score for each item:
jlike(user,itemj)×sim(itemj,item).\sum_j like(user, item_j) \times sim(item_j, item).
  • Return the top 100 highest-scoring items as retrieval results.

Swing Retrieval Channel

Swing Model

Let items liked by user u1u_1 form set J1\mathcal{J}_1.

Let items liked by user u2u_2 form set J2\mathcal{J}_2.

Define the overlap between two users:

overlap(u1,u2)=J1J2\textbf{overlap}(u_1, u_2) = |\mathcal{J}_1 \cap \mathcal{J}_2|

If users u1u_1 and u2u_2 have high overlap, they may come from the same small circle; their weight should be reduced.

Swing Model

Let users who like item i1i_1 form set W1\mathcal{W}_1.

Let users who like item i2i_2 form set W2\mathcal{W}_2.

Define intersection V=W1W2\mathcal{V} = \mathcal{W}_1 \cap \mathcal{W}_2.

Similarity between the two items:

sim(i1,i2)=u1Vu2V1α+overlap(u1,u2)sim(i_1, i_2) = \sum_{u_1 \in \mathcal{V}} \sum_{u_2 \in \mathcal{V}} \frac{1}{\alpha + \text{overlap}(u_1, u_2)}

Summary

  • The only difference between Swing and ItemCF is item similarity.
  • ItemCF: If the proportion of overlapping users between two items is high, the two items are considered similar.
  • Swing: Additionally considers whether overlapping users come from the same small circle.
    • Users who like both items form set V\mathcal{V}.
    • For users u1u_1 and u2u_2 in V\mathcal{V}, their overlap is overlap(u1,u2)\text{overlap}(u_1, u_2).
    • High overlap between two users means they may come from the same small circle; their weight is reduced.

User-Based Collaborative Filtering (UserCF)

Basic Idea

If user user1user_1 is similar to user user2user_2, and user2user_2 likes a certain item, then user1user_1 is also likely to like that item.

UserCF Implementation

Similarity between users: sim(user,userj)sim(user, user_j)

User's interest in an item: like(userj,item)like(user_j, item)

Estimated user interest in a candidate item: jsim(user,userj)×like(userj,item)\sum_j sim(user, user_j) \times like(user_j, item)

User Similarity

Computing user similarity

Let items liked by user u1u_1 form set J1\mathcal{J}_1.

Let items liked by user u2u_2 form set J2\mathcal{J}_2.

Define intersection I=J1J2I = \mathcal{J}_1 \cap \mathcal{J}_2.

Similarity between the two users:

sim(u1,u2)=IJ1J2sim(u_1, u_2) = \frac{|I|}{\sqrt{|\mathcal{J}_1| \cdot |\mathcal{J}_2|}}

Downweighting popular items

Let items liked by user u1u_1 form set J1\mathcal{J}_1.

Let items liked by user u2u_2 form set J2\mathcal{J}_2.

Define intersection I=J1J2I = \mathcal{J}_1 \cap \mathcal{J}_2.

Similarity between the two users:

sim(u1,u2)=lI1log(1+nl)J1J2.sim(u_1, u_2) = \frac{\sum_{l \in I} \frac{1}{\log(1 + n_l)}}{\sqrt{|\mathcal{J}_1| \cdot |\mathcal{J}_2|}}.

Where nln_l denotes the number of users who like item ll, reflecting the item's popularity.

Full UserCF Retrieval Workflow

Offline Precomputation

Build "user → item" index

  • Record item IDs the user has recently clicked or interacted with.
  • Given any user ID, retrieve the list of items they have recently shown interest in.

Build "user → user" index

  • For each user, index the kk most similar users.
  • Given any user ID, quickly find the kk most similar users.

Online Retrieval

  1. Given a user ID, use the "user → user" index to find the top-k most similar users.
  2. For each top-k similar user, use the "user → item" index to find the list of items they have recently shown interest in (last-n).
  3. For the retrieved nknk similar items, use the formula to estimate user interest scores.
  4. Return the top 100 highest-scoring items as retrieval results.

Summary

UserCF Principle

User u1u_1 is similar to u2u_2, and u2u_2 likes a certain item; then u1u_1 may also like that item.

User similarity:

  • If users u1u_1 and u2u_2 have large overlap in liked items, they are similar.
  • Formula:
sim(u1,u2)=J1J2J1J2.sim(u_1, u_2) = \frac{|\mathcal{J}_1 \cap \mathcal{J}_2|}{\sqrt{|\mathcal{J}_1| \cdot |\mathcal{J}_2|}}.

UserCF Retrieval Channel

Maintain two indexes:

  • User → item list: nn items the user has recently interacted with.
  • User → user list: kk users with highest similarity.

Online retrieval:

  • Use both indexes to retrieve up to nknk items per request.
  • Estimate interest score for user useruser on each item itemitem:
jsim(user,userj)×like(userj,item).\sum_j sim(user, user_j) \times like(user_j, item).
  • Return the top 100 highest-scoring items as retrieval results.

Discrete Feature Processing

  1. Build a dictionary: Map categories to indices.

    • China → 1
    • USA → 2
    • India → 3
  2. Vectorize: Map indices to vectors.

    • One-hot encoding: Map indices to high-dimensional sparse vectors.
    • Embedding: Map indices to low-dimensional dense vectors.

One-Hot Encoding

One-hot encoding for nationality features

Nationality: China, USA, India, etc. — 200 categories.

Dictionary: China → 1, USA → 2, India → 3, ...

One-hot encoding: Represent nationality as a 200-dimensional sparse vector.

  • Unknown → 0 → [0,0,0,0,...,0]
  • China → 1 → [1,0,0,0,...,0]
  • USA → 2 → [0,1,0,0,...,0]
  • India → 3 → [0,0,1,0,...,0]

Embedding

One-hot encoding can be mapped to embedding vectors.

Number of parameters: vector dimension × number of categories.

  • Suppose embedding vectors are 4-dimensional.
  • There are 200 nationalities.
  • Number of parameters = 4 × 200 = 800.

Implementation: TensorFlow and PyTorch provide embedding layers.

  • Parameters are stored as a matrix of size vector dimension × number of categories.
  • Input is an index, e.g., "USA" has index 2.
  • Output is a vector, e.g., "USA" corresponds to the 2nd column of the parameter matrix.

Summary

Discrete feature processing: one-hot encoding, embedding.

When the number of categories is large, use embedding.

  • Word embedding.
  • User ID embedding.
  • Item ID embedding.

Matrix Completion

Embedding matrix A outputs a column vector a for each user; the number of users equals the number of columns.

Embedding matrix B outputs a column vector b for each item; the number of items equals the number of columns.

Basic Idea

User embedding parameter matrix is A. User uu corresponds to the uu-th column, denoted vector au\mathbf{a}_u.

Item embedding parameter matrix is B. Item ii corresponds to the ii-th column, denoted vector bi\mathbf{b}_i.

Inner product au,bi\langle \mathbf{a}_u, \mathbf{b}_i \rangle estimates user uu's interest in item ii.

The goal of training is to learn matrices A and B so that estimated values fit the observed true interest scores; A and B are the embedding layer parameters.

Dataset

Dataset: a collection of (user ID, item ID, interest score) triples, denoted Ω={(u,i,y)}\Omega = \{(u, i, y)\}.

Interest scores in the dataset are system-recorded, for example:

  • Exposed but not clicked → 0 points
  • Click, like, favorite, share → 1 point each
  • Minimum score is 0, maximum is 4

Training

Map user ID and item ID to vectors.

  • User uu → vector au\mathbf{a}_u
  • Item ii → vector bi\mathbf{b}_i

Solve the optimization problem using gradient descent to obtain parameters A and B:

minA,B(u,i,y)Ω(yau,bi)2.\min_{\mathbf{A}, \mathbf{B}} \sum_{(u,i,y) \in \Omega} \left( y - \langle \mathbf{a}_u, \mathbf{b}_i \rangle \right)^2.

Matrix Completion

After training, gray (missing) positions can be filled in, enabling recommendations based on interest scores.

Why It Performs Poorly in Practice...

Drawback 1: Only uses ID embedding; ignores item and user attributes.

  • Item attributes: category, keywords, location, author information.
  • User attributes: gender, age, geolocation, interested categories.
  • Two-tower models can be seen as an upgraded version of matrix completion.

Drawback 2: Negative sample selection is wrong.

  • Sample: (user, item) pair denoted (u,i)(u, i).
  • Positive sample: exposed and then clicked/interacted. (Correct approach)
  • Negative sample: exposed but not clicked/interacted. (Wrong approach)

Drawback 3: Training method is suboptimal.

  • Inner product au,bi\langle {\mathbf{a}_u}, {\mathbf{b}_i} \rangle is inferior to cosine similarity.
  • Using squared loss (regression) is worse than cross-entropy loss (classification).

Model Storage

  1. Training yields matrices A and B.

    • Each column of A corresponds to one user.
    • Each column of B corresponds to one item.
  2. Store columns of matrix A in a key-value table.

    • Key is user ID; value is a column of A.
    • Given a user ID, return a vector (the user's embedding).
  3. Storage and indexing of matrix B is more complex.

Online Serving

  1. Use the user ID as key to query the key-value table and retrieve the user's vector, denoted a\mathbf{a}.

  2. Nearest-neighbor search: find the kk items the user is most likely interested in, as retrieval results.

    • Item ii's embedding vector is bi\mathbf{b}_i.
    • Inner product a,bi\langle \mathbf{a}, \mathbf{b}_i \rangle estimates user interest in item ii.
    • Return the kk items with the largest inner product.

If enumerating all items, time complexity is proportional to the number of items. To find the k items of highest interest, interest scores for all items must be computed.

Systems: Milvus, Faiss, HnswLib, etc.

Nearest-neighbor criteria:

  • Smallest Euclidean distance (L2 distance)
  • Largest vector inner product (inner product similarity)
  • Largest cosine angle (cosine similarity)

a represents a user's embedding vector; the scatter points represent item embedding vectors. The goal is to find the nearest-neighbor vector to a (nearest neighbor in matrix completion means the largest inner product).

Partition the scatter points into regions; represent each region with a centroid vector, replacing all item vectors within that region.

Find the nearest-neighbor region vector to a, then compute the top-k nearest item vectors within that region.

Summary

Matrix Completion

  • Embed item IDs and user IDs, mapping them to vectors.
  • Inner product au,bi\langle {\mathbf{a}_u}, {\mathbf{b}_i} \rangle as user uu's estimated interest in item ii.
  • Fit au,bi\langle {\mathbf{a}_u}, {\mathbf{b}_i} \rangle to observed true interest scores to learn embedding layer parameters.
  • Matrix completion has many drawbacks and performs poorly.

Online Retrieval

  • Use user vector a{\mathbf{a}} as query; find item ii that maximizes a,bi\langle {\mathbf{a}}, {\mathbf{b}_i} \rangle.
  • Brute-force enumeration is too slow. In practice, use approximate nearest-neighbor search.
  • Vector databases such as Milvus, Faiss, and HnswLib support approximate nearest-neighbor search.

Two-Tower Model: Architecture and Training

Two-Tower Model

Training the Two-Tower Model

  • Pointwise: Treat each positive and negative sample independently; do simple binary classification.
  • Pairwise: Each batch contains one positive sample and one negative sample.
  • Listwise: Each batch contains one positive sample and multiple negative samples.

Positive and Negative Sample Selection

  • Positive samples: Items the user clicked.
  • Negative samples [1,2]:
    • Items not retrieved?
    • Items retrieved but filtered by pre-ranking or full ranking?
    • Items exposed but not clicked?

Pointwise Training

  • Frame retrieval as a binary classification task.
  • For positive samples, encourage cos(a,b)\cos(\mathbf{a}, \mathbf{b}) to approach +1+1.
  • For negative samples, encourage cos(a,b)\cos(\mathbf{a}, \mathbf{b}) to approach 1-1.
  • Control ratio of positive to negative samples at 1:21:2 or 1:31:3.

Pairwise Training

The two item towers share the same parameters.

Basic idea: Encourage cos(a,b+)\cos(\mathbf{a}, \mathbf{b}^+) to be greater than cos(a,b)\cos(\mathbf{a}, \mathbf{b}^-).

  • If cos(a,b+)\cos(\mathbf{a}, \mathbf{b}^+) exceeds cos(a,b)+m\cos(\mathbf{a}, \mathbf{b}^-) + m, there is no loss. mm is a hyperparameter.
  • Otherwise, loss = cos(a,b)+mcos(a,b+)\cos(\mathbf{a}, \mathbf{b}^-) + m - \cos(\mathbf{a}, \mathbf{b}^+).

Loss Functions

Triplet hinge loss:

L(a,b+,b)=max{0,cos(a,b)+mcos(a,b+)}L(\mathbf{a}, \mathbf{b}^+, \mathbf{b}^-) = \max \{ 0, \cos(\mathbf{a}, \mathbf{b}^-) + m - \cos(\mathbf{a}, \mathbf{b}^+) \}

Triplet logistic loss:

L(a,b+,b)=log(1+exp[σ(cos(a,b)cos(a,b+))]).L(\mathbf{a}, \mathbf{b}^+, \mathbf{b}^-) = \log(1 + \exp[\sigma \cdot (\cos(\mathbf{a}, \mathbf{b}^-) - \cos(\mathbf{a}, \mathbf{b}^+))]).

Listwise Training

  • One data point contains:

    • One user with feature vector a\mathbf{a}.
    • One positive sample with feature vector b+\mathbf{b}^+.
    • Multiple negative samples with feature vectors b1,,bn\mathbf{b}_1^-, \dots, \mathbf{b}_n^-.
  • Encourage cos(a,b+)\cos(\mathbf{a}, \mathbf{b}^+) to be as large as possible.

  • Encourage cos(a,b1),,cos(a,bn)\cos(\mathbf{a}, \mathbf{b}_1^-), \dots, \cos(\mathbf{a}, \mathbf{b}_n^-) to be as small as possible.

Encourage cos(a,b+)\cos(\mathbf{a}, \mathbf{b}^+) to approach 1; encourage cos(a,b1),,cos(a,bn)\cos(\mathbf{a}, \mathbf{b}_1^-), \dots, \cos(\mathbf{a}, \mathbf{b}_n^-) to approach 0.

Loss function is cross-entropy loss.

Summary

Two-Tower Model

  • User tower and item tower each output a single vector.

  • Cosine similarity of the two vectors serves as the estimated interest score.

  • Three training approaches:

    • Pointwise: Each batch uses one user and one item (positive or negative).
    • Pairwise: Each batch uses one user, one positive sample, and one negative sample.
    • Listwise: Each batch uses one user, one positive sample, and multiple negative samples.

Models Not Suitable for Retrieval: Early Fusion Models

The two-tower model uses late fusion for retrieval — its advantage is that item representations can be precomputed via the item tower. Each time a user arrives, their representation is computed via the user tower; fast nearest-neighbor methods then retrieve the top-k items closest to the user representation.

If early fusion is used instead, item representations cannot be precomputed. To retrieve k items, the user's features must be fused with each item's features and passed through the neural network to get an interest score — requiring processing every item, losing the advantage of fast nearest-neighbor methods.

Two-Tower Model: Positive and Negative Samples

Positive Samples

  • Positive samples: (user, item) pairs that were exposed and then clicked. (User is interested in the item)

  • Problem: A small fraction of items account for most clicks, causing positive samples to be mostly popular items.

  • Solution: Oversample rare items, or downsample popular items.

  • Oversampling (up-sampling): A sample appears multiple times.

  • Downsampling (down-sampling): Some samples are discarded.

How to Select Negative Samples

Negative sample selection criteria differ for retrieval, pre-ranking, full ranking, and re-ranking.

Simple Negative Samples

Simple Negative Samples: Full Item Pool

  • Items not retrieved are very likely to be uninteresting to the user.
  • Items not retrieved ≈ the full item pool.
  • Sample from the full item pool as negative samples.
  • Uniform or non-uniform sampling?

Uniform sampling: Unfair to rare items.

  • Positive samples are mostly popular items.
  • If negative samples are uniformly sampled, negative samples are mostly rare items.

Non-uniform sampling: Aimed at suppressing popular items.

  • Negative sample probability proportional to popularity (click count).
  • sampling probability(click count)0.75\text{sampling probability} \propto (\text{click count})^{0.75}. 0.75 is an empirical value.

Simple Negative Samples: In-Batch Negative Samples

  • A batch has nn positive samples.

  • One user pp paired with n1n-1 items forms negative samples.

  • The batch has n(n1)n(n-1) negative samples in total.

  • All are simple negative samples. (The first user doesn't like the second item.)

  • Probability of an item appearing in the batch click count\propto \text{click count}. More popular items have higher probability of appearing.

  • Probability of an item becoming a negative sample should ideally be click count0.75\propto {\text{click count}}^{0.75}, but here it is click count\propto \text{click count}.

  • Popular items have too high a probability of becoming negative samples.

  • Sampling probability of item ii: piclick countp_i \propto {\text{click count}}. More popular items have higher probability.

  • Estimated user interest in item ii: cos(a,bi)\cos(\mathbf{a}, \mathbf{b}_i)

  • During training, adjust to: cos(a,bi)logpi\cos(\mathbf{a}, \mathbf{b}_i) - \log p_i. This debiases and prevents over-suppression of popular items. Reasoning: cos(a,bi)logpi\cos(\mathbf{a}, \mathbf{b}_i) - \log p_i is used for training; more popular items have a smaller adjusted value. When a popular item and a rare item have the same raw cos(a,bi)\cos(\mathbf{a}, \mathbf{b}_i), the popular item's adjusted value is smaller. After training, the popular item's cos(a,bi)\cos(\mathbf{a}, \mathbf{b}_i) ends up larger than the rare item's.

Hard Negative Samples

Hard Negative Samples

  • Items filtered out by pre-ranking (relatively hard).
  • Items ranked near the bottom by full ranking (very hard).

Binary classification of positive and negative samples:

  • Full item pool (easy): high classification accuracy.
  • Items filtered by pre-ranking (relatively hard): more likely to be misclassified.
  • Items ranked near the bottom by full ranking (very hard): even more likely to be misclassified.

Training Data

  • Mix multiple types of negative samples.

  • 50% of negative samples from the full item pool (simple negative samples).

  • 50% of negative samples from items that did not pass ranking (hard negative samples).

Common Mistakes

Items exposed but not clicked should not be used as negative samples for retrieval model training.

They can be used as negative samples for ranking model training.

Principles for Negative Sample Selection

Retrieval goal: Quickly find items the user may be interested in.

  • Full item pool (easy): Vast majority are completely uninteresting to the user.
  • Filtered by ranking (hard): User may be interested, but not enough.
  • Exposed but not clicked (unusable): User is interested but may not have clicked by chance.
    • Can be used as negative samples for ranking; cannot be used as negative samples for retrieval.

Summary

  • Positive samples: Exposed and then clicked.

  • Simple negative samples:

    • Full item pool.
    • In-batch negative samples.
  • Hard negative samples: Retrieved but filtered out by ranking.

  • Error: Using exposed-but-not-clicked items as negative samples for retrieval.

Two-Tower Model: Online Retrieval and Updates

Online Retrieval

Two-Tower Model Retrieval

Offline storage: Store item vectors b\mathbf{b} in a vector database.

  1. After training, use the item tower to compute feature vector b\mathbf{b} for every item.
  2. Store hundreds of millions of item vectors b\mathbf{b} in a vector database (e.g., Milvus, Faiss, HnswLib).
  3. Vector database builds an index to accelerate nearest-neighbor search.

Online retrieval: Find the top-k items the user is most interested in.

  1. Given user ID and profile, compute user vector a\mathbf{a} online using the neural network.

  2. Nearest-neighbor search:

    • Use vector a\mathbf{a} as query; call the vector database for nearest-neighbor search.
    • Return the kk items with the highest cosine similarity as retrieval results.

Why store item vectors b\mathbf{b} offline and compute user vector a\mathbf{a} online?

  • Each retrieval uses one user vector a\mathbf{a} and hundreds of millions of item vectors b\mathbf{b}.
    (Computing item vectors online is too costly.)

  • User interests change dynamically; item features are relatively stable.
    (Storing user vectors offline is possible but hurts recommendation quality.)

Model Updates

Full Update vs. Incremental Update

Full update: At midnight tonight, train the model using all of yesterday's data.

  • Start from yesterday's model parameters (not random initialization).
  • Train for 1 epoch on yesterday's data — each day's data is used only once.
  • Release new user tower neural network and item vectors for online retrieval.
  • Full update has lower requirements on data pipelines and systems.

Incremental update: Apply online learning to update model parameters.

  • User interests change in real time.
  • Collect online data in real time; apply stream processing to generate TFRecord files.
  • Apply online learning to incrementally update ID Embedding parameters.
  • Release updated user ID embeddings for the user tower to compute user vectors online.

Question: Can we only do incremental updates and skip full updates?

  • User behavior differs across time periods within a day; hourly data is biased; minute-level data is even more biased.
  • Full update: randomly shuffle one full day of data; train for 1 epoch.
  • Incremental update: train for 1 epoch in chronological order.
  • Random shuffling is better than sequential ordering; full training is better than incremental training.

Summary

Two-Tower Model

  • User tower and item tower each output a vector; cosine similarity of the two vectors estimates interest.

  • Three training approaches: pointwise, pairwise, listwise.

  • Positive samples: items the user clicked.

  • Negative samples: full item pool (simple), items filtered out by ranking (hard).

Retrieval

  • After training, store item vectors in a vector database for online nearest-neighbor search.

  • Online retrieval: given user ID and profile, call the user tower to compute user vector a\mathbf{a} in real time.

  • Use a\mathbf{a} as query; search the vector database; find the kk item vectors with highest cosine similarity; return kk item IDs.

Model Updates

  • Full update: at midnight, use all of yesterday's data to train the full neural network for 1 epoch of SGD.

  • Incremental update: use real-time data to train the neural network; only update ID Embeddings; freeze fully connected layers.

  • Production systems:

    • Combine full updates and incremental updates.
    • Publish the latest user ID embeddings every few tens of minutes for the user tower to compute user vectors online.

Two-Tower Model + Self-Supervised Learning

Two-Tower Model Problem

  • Recommender systems exhibit strong head effects:
    • A small fraction of items account for most clicks.
    • Most items have low click counts.
  • High-click items learn good representations; long-tail items learn poor representations.
  • Self-supervised learning: apply data augmentation to better learn long-tail item vectors.

Review of Two-Tower Model

In-Batch Negative Samples

  • A batch has nn positive sample pairs.
  • Form nn lists; each list has 1 positive pair and n1n-1 negative pairs.

Listwise Training

  • A batch contains nn positive sample pairs (with clicks):

    (a1,b1),(a2,b2),,(an,bn).(\mathbf{a_1}, \mathbf{b_1}), (\mathbf{a_2}, \mathbf{b_2}), \dots, (\mathbf{a_n}, \mathbf{b_n}).

  • Negative samples: {(ai,bj)}\{(\mathbf{a_i}, \mathbf{b_j})\} for all iji \neq j.

  • Encourage cos(ai,bi)\cos(\mathbf{a_i}, \mathbf{b_i}) to be as large as possible; encourage cos(ai,bj)\cos(\mathbf{a_i}, \mathbf{b_j}) to be as small as possible.

Loss Function

Debiasing

  • Sampling probability of item jj:

    pjclick countp_j \propto \text{click count}

  • Estimated interest of user ii in item jj: cos(ai,bj)\cos(\mathbf{a_i}, \mathbf{b_j})

  • During training, replace cos(ai,bj)\cos(\mathbf{a_i}, \mathbf{b_j}) with:

    cos(ai,bj)logpj\cos(\mathbf{a_i}, \mathbf{b_j}) - \log p_j

Training the Two-Tower Model

  • Randomly sample nn (user, item) pairs from click data to form a batch.

  • Two-tower model loss function:

Lmain[i]=log(exp(cos(ai,bi)logpi)j=1nexp(cos(ai,bj)logpj))L_{\text{main}}[i] = -\log \left( \frac{\exp(\cos(\mathbf{a_i}, \mathbf{b_i}) - \log p_i)}{\sum_{j=1}^{n} \exp(\cos(\mathbf{a_i}, \mathbf{b_j}) - \log p_j)} \right)
  • Apply gradient descent to minimize the loss:
1ni=1nLmain[i]\frac{1}{n} \sum_{i=1}^{n} L_{\text{main}}[i]

Self-Supervised Learning

  • The two vector representations bi\mathbf{b'_i} and bi\mathbf{b''_i} of item ii have high similarity.

  • The vector representations bi\mathbf{b'_i} and bj\mathbf{b''_j} of items ii and jj have low similarity.

  • Encourage cos(bi,bi)\cos(\mathbf{b'_i}, \mathbf{b''_i}) to be as large as possible; encourage cos(bi,bj)\cos(\mathbf{b'_i}, \mathbf{b''_j}) to be as small as possible.

Feature transformation converts an item's feature vector into another vector.

For example, an item's feature vector might be (audience gender, category, city, occupation) = (0, 5, 1.5, 9). Feature transformation converts (0, 5, 1.5, 9) into, say, (0.9, 7.3, 10.8, 9.6) (a simplified example).

Feature Transformation: Random Mask

  • Randomly select some discrete features (e.g., category) and mask them.
  • Example:
    • An item's category feature is U={electronics,photography}\mathcal{U} = \{\text{electronics}, \text{photography}\}.
    • After masking, the category feature becomes U={default}\mathcal{U'} = \{\text{default}\}. Default represents the missing value.
    • Masking means discarding all values of a feature.
    • For example, if electronics = 1 and photography = 2, with missing default = 0, the pre-mask category value might be 1.5; after masking, it becomes 0.
  • Random mask does not mask all features — only some are randomly masked.

Feature Transformation: Dropout (applies only to multi-valued discrete features)

  • An item can belong to multiple categories; category is a multi-valued discrete feature.
  • Dropout: randomly discard 50% of feature values.
  • Example:
    • An item's category feature is U={beauty,photography}\mathcal{U} = \{\text{beauty}, \text{photography}\}.
    • After dropout, category becomes U={beauty}\mathcal{U'} = \{\text{beauty}\}.
    • For example, if beauty = 1 and photography = 2, the pre-dropout category value might be 1.5; post-dropout it becomes 1.

Feature Transformation: Complementary Features

  • Suppose an item has 4 features:

    ID, category, keywords, city

  • Randomly split into two groups: {ID,keywords}\{\text{ID}, \text{keywords}\} and {category,city}\{\text{category}, \text{city}\}

  • {ID,default,keywords,default}\{\text{ID}, \text{default}, \text{keywords}, \text{default}\} \quad \Rightarrow \quad item representation

  • {default,category,default,city}\{\text{default}, \text{category}, \text{default}, \text{city}\} \quad \Rightarrow \quad item representation

  • Since they represent the same item, encourage the two item representations to be similar.

Feature Transformation: Mask a Group of Correlated Features

  • A group of correlated features:

    • Audience gender: U={male,female,neutral}\mathcal{U} = \{\text{male}, \text{female}, \text{neutral}\}

    • Category: V={beauty,electronics,football,photography,tech,}\mathcal{V} = \{\text{beauty}, \text{electronics}, \text{football}, \text{photography}, \text{tech}, \dots\}

    • u=femaleu = \text{female} and v=beautyv = \text{beauty} co-occur with high probability p(u,v)p(u, v).

    • u=femaleu = \text{female} and v=electronicsv = \text{electronics} co-occur with low probability p(u,v)p(u, v).

  • p(u)p(u): probability of a feature value being uu.

    • p(male)=20%p(\text{male}) = 20\%
    • p(female)=30%p(\text{female}) = 30\%
    • p(neutral)=50%p(\text{neutral}) = 50\%
  • p(u,v)p(u, v): probability that one feature is uu and another is vv simultaneously.

    • p(female,beauty)=3%p(\text{female}, \text{beauty}) = 3\%
    • p(female,electronics)=0.1%p(\text{female}, \text{electronics}) = 0.1\%
  • Offline: compute pairwise feature correlations using mutual information (MI): (e.g., computing MI between category and audience gender)

MI(U,V)=uUvVp(u,v)logp(u,v)p(u)p(v)MI(\mathcal{U}, \mathcal{V}) = \sum_{u \in \mathcal{U}} \sum_{v \in \mathcal{V}} p(u, v) \cdot \log \frac{p(u, v)}{p(u) \cdot p(v)}
  • Suppose an item has kk features. Offline: compute pairwise MI for all features, producing a k×kk \times k matrix.

  • Randomly select one feature as seed; find the k/2k/2 features most correlated with the seed.

  • Mask the seed and its k/2k/2 correlated features; keep the remaining k/2k/2 features.

  • Example: An item has four features {ID,category,keywords,city}\{\text{ID}, \text{category}, \text{keywords}, \text{city}\}. The seed is keywords, and the most correlated feature is city. After masking: {ID,category,default,default}\{\text{ID}, \text{category}, \text{default}, \text{default}\}.

  • Pros and cons:

    • Pro: Outperforms random mask, dropout, and complementary features.
    • Con: Complex method, difficult to implement and maintain.

Training the Model

  • Uniformly sample mm items from all items to form a batch.

  • Apply two types of feature transformations; the item tower outputs two sets of vectors:

b1,b2,,bmandb1,b2,,bm\mathbf{b'_1}, \mathbf{b'_2}, \dots, \mathbf{b'_m} \quad \text{and} \quad \mathbf{b''_1}, \mathbf{b''_2}, \dots, \mathbf{b''_m}
  • Loss function for the ii-th item:
Lself[i]=log(exp(cos(bi,bi))j=1mexp(cos(bi,bj)))L_{\text{self}}[i] = -\log \left( \frac{\exp(\cos(\mathbf{b'_i}, \mathbf{b''_i}))}{\sum_{j=1}^{m} \exp(\cos(\mathbf{b'_i}, \mathbf{b''_j}))} \right)

  • Self-supervised learning loss function:
Lself[i]=log(exp(cos(bi,bi))j=1mexp(cos(bi,bj)))L_{\text{self}}[i] = -\log \left( \frac{\exp(\cos(\mathbf{b'_i}, \mathbf{b''_i}))}{\sum_{j=1}^{m} \exp(\cos(\mathbf{b'_i}, \mathbf{b''_j}))} \right)
  • Apply gradient descent to minimize self-supervised learning loss:
1mi=1mLself[i]\frac{1}{m} \sum_{i=1}^{m} L_{\text{self}}[i]

Summary

  • Two-tower model learns poor vector representations for low-exposure items.

  • Self-supervised learning:

    • Apply random feature transformations to items.
    • Feature vectors bi\mathbf{b'_i} and bi\mathbf{b''_i} have high similarity (same item).
    • Feature vectors bi\mathbf{b'_i} and bj\mathbf{b''_j} have low similarity (different items).
  • Experimental results: recommendations for low-exposure items and new items become more accurate.

  • Randomly sample nn (user, item) pairs from click data to form a batch.

  • Uniformly sample mm items from all items to form a batch.

  • Apply gradient descent to minimize loss:

1ni=1nLmain[i]+α1mj=1mLself[j].\frac{1}{n} \sum_{i=1}^{n} L_{\text{main}}[i] + \alpha \cdot \frac{1}{m} \sum_{j=1}^{m} L_{\text{self}}[j].

Deep Retrieval

  • The classical two-tower model represents users and items as vectors; online nearest-neighbor search is performed.

  • Deep Retrieval represents items as paths (path); online search finds paths best matching the user.

  • Deep Retrieval is similar to Alibaba's TDM.

Outline

  1. Indexes:

    • path → List<item>
    • item → List<path>
  2. Estimation model: neural network estimates user interest in a path.

  3. Online retrieval: user → path → item.

  4. Training:

    • Learn neural network parameters.
    • Learn item representations (item → path).

Indexes

Items Represented as Paths

  • Depth: depth = 3 (number of layers).

  • Width: width = K.

  • An item is represented as one path (path), e.g., [2,4,1]\mathbf{[2,4,1]}.

  • An item can be represented as multiple paths, e.g., {[2,4,1],[4,1,1]}\{\mathbf{[2,4,1]}, \mathbf{[4,1,1]}\}.

Item-to-Path Index

Index: item → List⟨path⟩

  • One item corresponds to multiple paths.
  • A path is represented by 3 nodes: path = [a,b,c][a, b, c].

Index: path → List⟨item⟩

  • One path corresponds to multiple items.

Estimation Model

Estimating User Interest in a Path

  • A path is represented by 3 nodes: path = [a,b,c][a, b, c].

  • Given user features x\mathbf{x}, estimate user interest in node aa: p1(ax)p_1(a | \mathbf{x}).

  • Given x\mathbf{x} and aa, estimate user interest in node bb: p2(ba;x)p_2(b | a; \mathbf{x}).

  • Given x,a,b\mathbf{x}, a, b, estimate user interest in node cc: p3(ca,b;x)p_3(c | a, b; \mathbf{x}).

  • Estimated user interest in path = [a,b,c][a, b, c]:

p(a,b,cx)=p1(ax)×p2(ba;x)×p3(ca,b;x)p(a, b, c | \mathbf{x}) = p_1(a | \mathbf{x}) \times p_2(b | a; \mathbf{x}) \times p_3(c | a, b; \mathbf{x})

User feature vector x\mathbf{x} is passed through a neural network and then a softmax activation to output the p1p_1 vector. p1p_1 represents the interest scores the network assigns to the KK nodes at layer L1; the higher the score, the more likely a node is selected. If L1 has KK nodes, p1p_1 is a KK-dimensional vector. Node aa is selected from L1's KK nodes based on p1p_1.

Node aa is embedded to get emb(aa). The original user feature vector x\mathbf{x} is concatenated with emb(aa) and fed into another neural network, followed by a softmax layer to output p2p_2. p2p_2 represents interest scores for the KK nodes at layer L2. Node bb is selected from L2.

Node cc is derived analogously.

Online Retrieval

Retrieval: user → path → item

  • Step 1: Given user features, use beam search to retrieve a set of paths.

  • Step 2: Use the index "path → List⟨item⟩" to retrieve a set of items.

  • Step 3: Score and rank the items; select a subset.

Beam Search

  • Suppose there are 3 layers, each with KK nodes; there are K3K^3 paths total.

  • Scoring all K3K^3 paths with the neural network is too expensive.

  • Beam search reduces computation.

  • Requires setting the hyperparameter beam size.

Select the top 4 nodes at L1 based on the neural network's scores for the KK nodes.

For each selected node aa, compute user interest in path [a,b][a, b]:

p(a,bx)=p1(ax)×p2(ba;x)p(a,b | \mathbf{x})= p_1(a | \mathbf{x}) \times p_2(b | a; \mathbf{x})

Compute 4×K4 \times K scores, one for each path; select the top 4 paths.

For each selected pair of nodes a,ba, b, compute user interest in path [a,b,c][a, b, c]:

p(a,b,cx)=p(a,bx)×p3(ca,b;x)p(a,b,c | \mathbf{x})= p(a,b | \mathbf{x}) \times p_3(c | a,b; \mathbf{x})

Again compute 4×K4 \times K scores; select the top 4 paths.

Beam Search

  • User interest in path = [a,b,c][a, b, c]:
p(a,b,cx)=p1(ax)×p2(ba;x)×p3(ca,b;x)p(a, b, c | \mathbf{x}) = p_1(a | \mathbf{x}) \times p_2(b | a; \mathbf{x}) \times p_3(c | a, b; \mathbf{x})
  • Optimal path:
[a,b,c]=argmaxa,b,cp(a,b,cx)[a^\star, b^\star, c^\star] = \arg\max\limits_{a, b, c} p(a, b, c | \mathbf{x})
  • Greedy algorithm (beam size = 1) selected path [a,b,c][a, b, c] is not necessarily optimal.

Online Retrieval

  • Step 1: Given user features, use the neural network and beam search to retrieve a set of paths.

  • Step 2: Use indexes to retrieve a set of items.

    • Look up index path → List⟨item⟩.
    • Each path corresponds to multiple items.
  • Step 3: Rank items; select a subset.

Training

Joint Learning of Neural Network Parameters and Item Representations

  • Neural network p(a,b,cx)p(a, b, c | \mathbf{x}) estimates user interest in path [a,b,c][a, b, c].

  • Represent each item as multiple paths {[a,b,c]}\{[a, b, c]\}; build indexes:

    • item → List⟨path⟩,
    • path → List⟨item⟩.
  • Positive sample (user, item): click(user,item)=1\text{click}(\text{user}, \text{item}) = 1.

Learning Neural Network Parameters

  • Item is represented by JJ paths: [a1,b1,c1],,[aJ,bJ,cJ][a_1, b_1, c_1], \dots, [a_J, b_J, c_J].

  • User interest in path [a,b,c][a, b, c]:

p(a,b,cx)=p1(ax)×p2(ba;x)×p3(ca,b;x)p(a, b, c \mid \mathbf{x}) = p_1(a \mid \mathbf{x}) \times p_2(b \mid a; \mathbf{x}) \times p_3(c \mid a, b; \mathbf{x})
  • If the user clicked the item, they are interested in all JJ paths.

  • Should maximize j=1Jp(aj,bj,cjx)\sum_{j=1}^{J} p(a_j, b_j, c_j \mid \mathbf{x}).

  • Loss function: The larger the total interest in JJ paths, the smaller the loss.

loss=log(j=1Jp(aj,bj,cjx))\text{loss} = -\log \left( \sum_{j=1}^{J} p(a_j, b_j, c_j \mid \mathbf{x}) \right)

Learning Item Representations

  • User user's interest in path path = [a,b,c][a, b, c]:
p(pathuser)=p(a,b,cx)p(\text{path} \mid \text{user}) = p(a, b, c \mid \mathbf{x})
  • Relevance of item item and path path:
score(item,path)=userp(pathuser)×click(user,item)\text{score}(\text{item}, \text{path}) = \sum_{\text{user}} p(\text{path} \mid \text{user}) \times \text{click}(\text{user}, \text{item})

click(user,item)\text{click}(\text{user}, \text{item}) = 1 if the user clicked the item, 0 otherwise.

  • Select JJ paths with highest score(item,path)\text{score}(\text{item}, \text{path}) as item's representation.

  • Select JJ paths Π={path1,,pathJ}\Pi = \{\text{path}_1, \dots, \text{path}_J\} as the item's representation.

  • Loss function (select paths highly correlated with item):

loss(item,Π)=log(j=1Jscore(item,pathj))\text{loss}(\text{item}, \Pi) = -\log \left( \sum_{j=1}^{J} \text{score}(\text{item}, \text{path}_j) \right)
  • Regularization term (avoid too many items clustering on one path):
reg(pathj)=(number of items on pathj)4.\text{reg}(\text{path}_j) = (\text{number of items on path}_j)^4.

Greedy Algorithm to Update Paths

  • Suppose the item is already represented by JJ paths Π={path1,,pathJ}\Pi = \{\text{path}_1, \dots, \text{path}_J\}. Now update paths in Π\Pi.

  • Fix {pathi}il\{\text{path}_i\}_{i \neq l} and select a new pathl\text{path}_l from unselected paths: Unselected paths are not limited to the current JJ paths; they can be selected from external candidates. The selection pool can be the top-NN paths with highest score(item,path)\text{score}(\text{item}, \text{path}). loss(item,Π)\text{loss}(\text{item}, \Pi) represents the loss of the path set consisting of {pathi}\{\text{path}_i\} and pathl\text{path}_l. reg(pathl)\text{reg}(\text{path}_l) prevents too many items on one path.

pathlargminpathlloss(item,Π)+αreg(pathl)\text{path}_l \gets \arg\min_{\text{path}_l} \text{loss}(\text{item}, \Pi) + \alpha \cdot \text{reg}(\text{path}_l)
  • Selected paths have high score score(item,pathl)\text{score}(\text{item}, \text{path}_l) without too many items on the path.

Comparison

Updating the Neural Network

  • Neural network predicts user interest in a path:
p(pathx)p(\text{path} \mid \mathbf{x})
  • Training data required:

    1. "Item → path" index.
    2. Items clicked by the user.
  • If the user clicked the item, and the item corresponds to path path\text{path}, update neural network parameters to increase p(pathx)p(\text{path} \mid \mathbf{x}).

Updating Item Representations

  • Assess item–path relevance:

    item user clicked item\longleftarrow_{\text{user clicked item}} user neural network score\longrightarrow_{\text{neural network score}} path

  • Associate each item with JJ paths:

    • Item and path must have high relevance.
    • A path cannot have too many items.

Summary

Retrieval: User → Path → Item

  • Given user features x\mathbf{x}, use the neural network to estimate user interest in path path=[a,b,c]\text{path} = [a, b, c]; score denoted p(pathx)p(\text{path} \mid \mathbf{x}).

  • Use beam search to find the ss paths with highest score p(pathx)p(\text{path} \mid \mathbf{x}).

  • Use index "pathList{item}\text{path} \rightarrow \text{List}\{\text{item}\}" to retrieve nn items on each path.

  • A total of s×ns \times n items are retrieved; do preliminary ranking and return the top-scoring items.

Training: Joint Learning of User–Path and Item–Path Relationships

  • An item is represented by JJ paths: path1,,pathJ\text{path}_1, \dots, \text{path}_J.

  • If the user clicked the item, update neural network parameters to increase:

j=1Jp(pathjx).\sum_{j=1}^{J} p(\text{path}_j \mid \mathbf{x}).
  • If user interest score p(pathx)p(\text{path} \mid \mathbf{x}) is high and the user clicked item item\text{item}, then item\text{item} and path\text{path} have relevance.

  • Find the JJ paths most relevant to item\text{item}, while avoiding too many items on one path.

Other Retrieval Channels

Geo-Location Based Retrieval

GeoHash Retrieval

  • Users may be interested in nearby events.
  • GeoHash: encoding of latitude/longitude; represents a rectangular geographic area.
  • Index: GeoHash → list of quality posts (sorted by time, descending).
  • This retrieval channel has no personalization.

Same-City Retrieval

  • Users may be interested in events in the same city.
  • Index: city → list of quality posts (sorted by time, descending).
  • This retrieval channel has no personalization.

Author-Based Retrieval

Followed Author Retrieval

  • Users are interested in posts published by authors they follow.

  • Index:

    user → followed authors
    author → published posts

  • Retrieval:

    user → followed authors → most recent posts

Interacted Author Retrieval

  • If the user is interested in a post (like, favorite, share), they may be interested in other posts by that author.

  • Index:
    user → interacted authors

  • Retrieval:
    user → interacted authors → most recent posts

Similar Author Retrieval

  • If a user likes a certain author, they like similar authors.

  • Index:
    author → similar authors (k authors)

  • Retrieval:
    user → interested authors (n authors) → similar authors (nk authors) → most recent posts (nk posts)

Cache-Based Retrieval

Cache-Based Retrieval

Idea: reuse full-ranking results from the previous nn recommendation sessions.

  • Background:

    • Full ranking outputs hundreds of posts and sends them to re-ranking.
    • Re-ranking applies diversity sampling and selects a few dozen.
    • More than half of full-ranking results are never exposed — wasted.
  • Cache posts ranked in the top 50 by full ranking that were not exposed; use them as a retrieval channel.

Cache has a fixed size and requires an eviction mechanism.

  • Once a post is successfully exposed, it is evicted from the cache.
  • If the cache exceeds capacity, remove the oldest-entered post.
  • A post can be retrieved at most 10 times; it is evicted after 10 retrievals.
  • Each post is stored for at most 3 days; evicted after 3 days.

Exposure Filtering & Bloom Filter

Exposure Filtering Problem

  • If a user has seen an item, don't expose it to them again.

  • For each user, record items already exposed to them. (Xiaohongshu only retrieves posts published within the past month, so only the past month's exposure history needs to be recorded per user.)

  • For each retrieved item, determine if it has already been exposed to that user; exclude previously exposed items.

  • If a user has seen nn items and rr items are retrieved, brute-force comparison requires O(nr)O(nr) time.

Bloom Filter

  • Bloom filter determines whether an item ID is in the set of already-exposed items.

  • If the answer is no, that item is definitely not in the set.

  • If the answer is yes, the item is very likely in the set. (May have false positives: a non-exposed item is incorrectly judged as exposed and filtered out.)

  • Bloom filter represents an item set as an mm-dimensional binary vector.

  • Each user has a set of exposed items, represented as a binary vector requiring mm bits of storage.

  • Bloom filter has kk hash functions; each maps an item ID to an integer between 00 and m1m-1.

Bloom Filter

  • Exposed item set size is nn; binary vector dimension is mm; uses kk hash functions.

  • Bloom filter false positive rate: δ(1exp(knm))k\delta \approx \left( 1 - \exp \left( -\frac{kn}{m} \right) \right)^{k}.

    • Larger nn means more 1s in the vector; higher false positive rate. (The probability that an unexposed item's kk hash positions are all 1 increases.)

    • Larger mm means a longer vector; less likely hash collisions occur.

    • kk too large or too small is suboptimal; kk has an optimal value.

  • Given tolerable false positive rate δ\delta, optimal parameters are:

k=1.44ln(1δ),m=2nln(1δ)k = 1.44 \cdot \ln \left( \frac{1}{\delta} \right), \quad m = 2n \cdot \ln \left( \frac{1}{\delta} \right)

Bloom Filter Drawbacks

  • Bloom filter represents an item set as a binary vector.

  • Adding an item to the set only requires setting kk positions in the vector to 1. (If already 1, no change.)

  • Bloom filter only supports adding items; it does not support deleting items. Removing an item from the set cannot undo its effect on the vector.

  • Every day, items older than 1 month need to be removed from the item set. (Expired items can never be retrieved; no need to record them in the Bloom filter. Reducing nn lowers the false positive rate.)


贡献者


这篇文章有帮助吗?

最近更新

Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0CCBYNCSA