Wang Shusen Recommender Systems Study Notes — Retrieval
Wang Shusen Recommender Systems Study Notes — Retrieval
Retrieval
Item-Based Collaborative Filtering (ItemCF)
Basic Idea
If a user likes , and is similar to , then the user is likely to also like .
ItemCF Implementation
User's interest in an item:
Similarity between items:
Estimated user interest in a candidate item:
Item Similarity
Computing item similarity (considering degree of user preference)
Let users who like item form set
Let users who like item form set
Define intersection
Similarity between the two items:
Computing item similarity (weighted by degree of user preference)
Let users who like item form set
Let users who like item form set
Define intersection
Similarity between the two items:
Full ItemCF Retrieval Workflow
Offline Precomputation
Build "user → item" index
- Record item IDs the user has recently clicked or interacted with.
- Given any user ID, retrieve the list of items they have recently shown interest in.
Build "item → item" index
- Compute pairwise similarity between items.
- For each item, index the most similar items.
- Given any item ID, quickly find its most similar items.
Online Retrieval
- Given a user ID, use the "user → item" index to find the list of items the user has recently interacted with (last-n).
- For each item in the last-n list, use the "item → item" index to find the top-k similar items.
- For the retrieved similar items (up to total), use the formula to estimate user interest scores.
- Return the top 100 highest-scoring items as recommendation results.
Using indexes, offline computation is large but online computation is small.
Summary
ItemCF Principle
User likes item ; then the user likes item which is similar to .
Item similarity:
-
If users who like and have large overlap, then and are similar.
-
Formula:
ItemCF Retrieval Channel
Maintain two indexes:
- User → item list: items the user has recently interacted with.
- Item → item list: items with highest similarity.
Online retrieval:
- Use both indexes to retrieve up to items per request.
- Estimate user interest score for each item:
- Return the top 100 highest-scoring items as retrieval results.
Swing Retrieval Channel
Swing Model
Let items liked by user form set .
Let items liked by user form set .
Define the overlap between two users:
If users and have high overlap, they may come from the same small circle; their weight should be reduced.
Swing Model
Let users who like item form set .
Let users who like item form set .
Define intersection .
Similarity between the two items:
Summary
- The only difference between Swing and ItemCF is item similarity.
- ItemCF: If the proportion of overlapping users between two items is high, the two items are considered similar.
- Swing: Additionally considers whether overlapping users come from the same small circle.
- Users who like both items form set .
- For users and in , their overlap is .
- High overlap between two users means they may come from the same small circle; their weight is reduced.
User-Based Collaborative Filtering (UserCF)
Basic Idea
If user is similar to user , and likes a certain item, then is also likely to like that item.
UserCF Implementation
Similarity between users:
User's interest in an item:
Estimated user interest in a candidate item:
User Similarity
Computing user similarity
Let items liked by user form set .
Let items liked by user form set .
Define intersection .
Similarity between the two users:
Downweighting popular items
Let items liked by user form set .
Let items liked by user form set .
Define intersection .
Similarity between the two users:
Where denotes the number of users who like item , reflecting the item's popularity.
Full UserCF Retrieval Workflow
Offline Precomputation
Build "user → item" index
- Record item IDs the user has recently clicked or interacted with.
- Given any user ID, retrieve the list of items they have recently shown interest in.
Build "user → user" index
- For each user, index the most similar users.
- Given any user ID, quickly find the most similar users.
Online Retrieval
- Given a user ID, use the "user → user" index to find the top-k most similar users.
- For each top-k similar user, use the "user → item" index to find the list of items they have recently shown interest in (last-n).
- For the retrieved similar items, use the formula to estimate user interest scores.
- Return the top 100 highest-scoring items as retrieval results.
Summary
UserCF Principle
User is similar to , and likes a certain item; then may also like that item.
User similarity:
- If users and have large overlap in liked items, they are similar.
- Formula:
UserCF Retrieval Channel
Maintain two indexes:
- User → item list: items the user has recently interacted with.
- User → user list: users with highest similarity.
Online retrieval:
- Use both indexes to retrieve up to items per request.
- Estimate interest score for user on each item :
- Return the top 100 highest-scoring items as retrieval results.
Discrete Feature Processing
-
Build a dictionary: Map categories to indices.
- China → 1
- USA → 2
- India → 3
-
Vectorize: Map indices to vectors.
- One-hot encoding: Map indices to high-dimensional sparse vectors.
- Embedding: Map indices to low-dimensional dense vectors.
One-Hot Encoding
One-hot encoding for nationality features
Nationality: China, USA, India, etc. — 200 categories.
Dictionary: China → 1, USA → 2, India → 3, ...
One-hot encoding: Represent nationality as a 200-dimensional sparse vector.
- Unknown → 0 → [0,0,0,0,...,0]
- China → 1 → [1,0,0,0,...,0]
- USA → 2 → [0,1,0,0,...,0]
- India → 3 → [0,0,1,0,...,0]
Embedding
One-hot encoding can be mapped to embedding vectors.
Number of parameters: vector dimension × number of categories.
- Suppose embedding vectors are 4-dimensional.
- There are 200 nationalities.
- Number of parameters = 4 × 200 = 800.
Implementation: TensorFlow and PyTorch provide embedding layers.
- Parameters are stored as a matrix of size vector dimension × number of categories.
- Input is an index, e.g., "USA" has index 2.
- Output is a vector, e.g., "USA" corresponds to the 2nd column of the parameter matrix.
Summary
Discrete feature processing: one-hot encoding, embedding.
When the number of categories is large, use embedding.
- Word embedding.
- User ID embedding.
- Item ID embedding.
Matrix Completion
Embedding matrix A outputs a column vector a for each user; the number of users equals the number of columns.
Embedding matrix B outputs a column vector b for each item; the number of items equals the number of columns.
Basic Idea
User embedding parameter matrix is A. User corresponds to the -th column, denoted vector .
Item embedding parameter matrix is B. Item corresponds to the -th column, denoted vector .
Inner product estimates user 's interest in item .
The goal of training is to learn matrices A and B so that estimated values fit the observed true interest scores; A and B are the embedding layer parameters.
Dataset
Dataset: a collection of (user ID, item ID, interest score) triples, denoted .
Interest scores in the dataset are system-recorded, for example:
- Exposed but not clicked → 0 points
- Click, like, favorite, share → 1 point each
- Minimum score is 0, maximum is 4
Training
Map user ID and item ID to vectors.
- User → vector
- Item → vector
Solve the optimization problem using gradient descent to obtain parameters A and B:
Matrix Completion
After training, gray (missing) positions can be filled in, enabling recommendations based on interest scores.
Why It Performs Poorly in Practice...
Drawback 1: Only uses ID embedding; ignores item and user attributes.
- Item attributes: category, keywords, location, author information.
- User attributes: gender, age, geolocation, interested categories.
- Two-tower models can be seen as an upgraded version of matrix completion.
Drawback 2: Negative sample selection is wrong.
- Sample: (user, item) pair denoted .
- Positive sample: exposed and then clicked/interacted. (Correct approach)
- Negative sample: exposed but not clicked/interacted. (Wrong approach)
Drawback 3: Training method is suboptimal.
- Inner product is inferior to cosine similarity.
- Using squared loss (regression) is worse than cross-entropy loss (classification).
Model Storage
-
Training yields matrices A and B.
- Each column of A corresponds to one user.
- Each column of B corresponds to one item.
-
Store columns of matrix A in a key-value table.
- Key is user ID; value is a column of A.
- Given a user ID, return a vector (the user's embedding).
-
Storage and indexing of matrix B is more complex.
Online Serving
-
Use the user ID as key to query the key-value table and retrieve the user's vector, denoted .
-
Nearest-neighbor search: find the items the user is most likely interested in, as retrieval results.
- Item 's embedding vector is .
- Inner product estimates user interest in item .
- Return the items with the largest inner product.
If enumerating all items, time complexity is proportional to the number of items. To find the k items of highest interest, interest scores for all items must be computed.
Systems Supporting Nearest-Neighbor Search
Systems: Milvus, Faiss, HnswLib, etc.
Nearest-neighbor criteria:
- Smallest Euclidean distance (L2 distance)
- Largest vector inner product (inner product similarity)
- Largest cosine angle (cosine similarity)
a represents a user's embedding vector; the scatter points represent item embedding vectors. The goal is to find the nearest-neighbor vector to a (nearest neighbor in matrix completion means the largest inner product).
Partition the scatter points into regions; represent each region with a centroid vector, replacing all item vectors within that region.
Find the nearest-neighbor region vector to a, then compute the top-k nearest item vectors within that region.
Summary
Matrix Completion
- Embed item IDs and user IDs, mapping them to vectors.
- Inner product as user 's estimated interest in item .
- Fit to observed true interest scores to learn embedding layer parameters.
- Matrix completion has many drawbacks and performs poorly.
Online Retrieval
- Use user vector as query; find item that maximizes .
- Brute-force enumeration is too slow. In practice, use approximate nearest-neighbor search.
- Vector databases such as Milvus, Faiss, and HnswLib support approximate nearest-neighbor search.
Two-Tower Model: Architecture and Training
Two-Tower Model
Training the Two-Tower Model
- Pointwise: Treat each positive and negative sample independently; do simple binary classification.
- Pairwise: Each batch contains one positive sample and one negative sample.
- Listwise: Each batch contains one positive sample and multiple negative samples.
Positive and Negative Sample Selection
- Positive samples: Items the user clicked.
- Negative samples [1,2]:
- Items not retrieved?
- Items retrieved but filtered by pre-ranking or full ranking?
- Items exposed but not clicked?
Pointwise Training
- Frame retrieval as a binary classification task.
- For positive samples, encourage to approach .
- For negative samples, encourage to approach .
- Control ratio of positive to negative samples at or .
Pairwise Training
The two item towers share the same parameters.
Basic idea: Encourage to be greater than .
- If exceeds , there is no loss. is a hyperparameter.
- Otherwise, loss = .
Loss Functions
Triplet hinge loss:
Triplet logistic loss:
Listwise Training
-
One data point contains:
- One user with feature vector .
- One positive sample with feature vector .
- Multiple negative samples with feature vectors .
-
Encourage to be as large as possible.
-
Encourage to be as small as possible.
Encourage to approach 1; encourage to approach 0.
Loss function is cross-entropy loss.
Summary
Two-Tower Model
-
User tower and item tower each output a single vector.
-
Cosine similarity of the two vectors serves as the estimated interest score.
-
Three training approaches:
- Pointwise: Each batch uses one user and one item (positive or negative).
- Pairwise: Each batch uses one user, one positive sample, and one negative sample.
- Listwise: Each batch uses one user, one positive sample, and multiple negative samples.
Models Not Suitable for Retrieval: Early Fusion Models
The two-tower model uses late fusion for retrieval — its advantage is that item representations can be precomputed via the item tower. Each time a user arrives, their representation is computed via the user tower; fast nearest-neighbor methods then retrieve the top-k items closest to the user representation.
If early fusion is used instead, item representations cannot be precomputed. To retrieve k items, the user's features must be fused with each item's features and passed through the neural network to get an interest score — requiring processing every item, losing the advantage of fast nearest-neighbor methods.
Two-Tower Model: Positive and Negative Samples
Positive Samples
-
Positive samples: (user, item) pairs that were exposed and then clicked. (User is interested in the item)
-
Problem: A small fraction of items account for most clicks, causing positive samples to be mostly popular items.
-
Solution: Oversample rare items, or downsample popular items.
-
Oversampling (up-sampling): A sample appears multiple times.
-
Downsampling (down-sampling): Some samples are discarded.
How to Select Negative Samples
Negative sample selection criteria differ for retrieval, pre-ranking, full ranking, and re-ranking.
Simple Negative Samples
Simple Negative Samples: Full Item Pool
- Items not retrieved are very likely to be uninteresting to the user.
- Items not retrieved ≈ the full item pool.
- Sample from the full item pool as negative samples.
- Uniform or non-uniform sampling?
Uniform sampling: Unfair to rare items.
- Positive samples are mostly popular items.
- If negative samples are uniformly sampled, negative samples are mostly rare items.
Non-uniform sampling: Aimed at suppressing popular items.
- Negative sample probability proportional to popularity (click count).
- . 0.75 is an empirical value.
Simple Negative Samples: In-Batch Negative Samples
-
A batch has positive samples.
-
One user paired with items forms negative samples.
-
The batch has negative samples in total.
-
All are simple negative samples. (The first user doesn't like the second item.)
-
Probability of an item appearing in the batch . More popular items have higher probability of appearing.
-
Probability of an item becoming a negative sample should ideally be , but here it is .
-
Popular items have too high a probability of becoming negative samples.
-
Sampling probability of item : . More popular items have higher probability.
-
Estimated user interest in item :
-
During training, adjust to: . This debiases and prevents over-suppression of popular items. Reasoning: is used for training; more popular items have a smaller adjusted value. When a popular item and a rare item have the same raw , the popular item's adjusted value is smaller. After training, the popular item's ends up larger than the rare item's.
Hard Negative Samples
Hard Negative Samples
- Items filtered out by pre-ranking (relatively hard).
- Items ranked near the bottom by full ranking (very hard).
Binary classification of positive and negative samples:
- Full item pool (easy): high classification accuracy.
- Items filtered by pre-ranking (relatively hard): more likely to be misclassified.
- Items ranked near the bottom by full ranking (very hard): even more likely to be misclassified.
Training Data
-
Mix multiple types of negative samples.
-
50% of negative samples from the full item pool (simple negative samples).
-
50% of negative samples from items that did not pass ranking (hard negative samples).
Common Mistakes
Items exposed but not clicked should not be used as negative samples for retrieval model training.
They can be used as negative samples for ranking model training.
Principles for Negative Sample Selection
Retrieval goal: Quickly find items the user may be interested in.
- Full item pool (easy): Vast majority are completely uninteresting to the user.
- Filtered by ranking (hard): User may be interested, but not enough.
- Exposed but not clicked (unusable): User is interested but may not have clicked by chance.
- Can be used as negative samples for ranking; cannot be used as negative samples for retrieval.
Summary
-
Positive samples: Exposed and then clicked.
-
Simple negative samples:
- Full item pool.
- In-batch negative samples.
-
Hard negative samples: Retrieved but filtered out by ranking.
-
Error: Using exposed-but-not-clicked items as negative samples for retrieval.
Two-Tower Model: Online Retrieval and Updates
Online Retrieval
Two-Tower Model Retrieval
Offline storage: Store item vectors in a vector database.
- After training, use the item tower to compute feature vector for every item.
- Store hundreds of millions of item vectors in a vector database (e.g., Milvus, Faiss, HnswLib).
- Vector database builds an index to accelerate nearest-neighbor search.
Online retrieval: Find the top-k items the user is most interested in.
-
Given user ID and profile, compute user vector online using the neural network.
-
Nearest-neighbor search:
- Use vector as query; call the vector database for nearest-neighbor search.
- Return the items with the highest cosine similarity as retrieval results.
Why store item vectors offline and compute user vector online?
-
Each retrieval uses one user vector and hundreds of millions of item vectors .
(Computing item vectors online is too costly.) -
User interests change dynamically; item features are relatively stable.
(Storing user vectors offline is possible but hurts recommendation quality.)
Model Updates
Full Update vs. Incremental Update
Full update: At midnight tonight, train the model using all of yesterday's data.
- Start from yesterday's model parameters (not random initialization).
- Train for 1 epoch on yesterday's data — each day's data is used only once.
- Release new user tower neural network and item vectors for online retrieval.
- Full update has lower requirements on data pipelines and systems.
Incremental update: Apply online learning to update model parameters.
- User interests change in real time.
- Collect online data in real time; apply stream processing to generate TFRecord files.
- Apply online learning to incrementally update ID Embedding parameters.
- Release updated user ID embeddings for the user tower to compute user vectors online.
Question: Can we only do incremental updates and skip full updates?
- User behavior differs across time periods within a day; hourly data is biased; minute-level data is even more biased.
- Full update: randomly shuffle one full day of data; train for 1 epoch.
- Incremental update: train for 1 epoch in chronological order.
- Random shuffling is better than sequential ordering; full training is better than incremental training.
Summary
Two-Tower Model
-
User tower and item tower each output a vector; cosine similarity of the two vectors estimates interest.
-
Three training approaches: pointwise, pairwise, listwise.
-
Positive samples: items the user clicked.
-
Negative samples: full item pool (simple), items filtered out by ranking (hard).
Retrieval
-
After training, store item vectors in a vector database for online nearest-neighbor search.
-
Online retrieval: given user ID and profile, call the user tower to compute user vector in real time.
-
Use as query; search the vector database; find the item vectors with highest cosine similarity; return item IDs.
Model Updates
-
Full update: at midnight, use all of yesterday's data to train the full neural network for 1 epoch of SGD.
-
Incremental update: use real-time data to train the neural network; only update ID Embeddings; freeze fully connected layers.
-
Production systems:
- Combine full updates and incremental updates.
- Publish the latest user ID embeddings every few tens of minutes for the user tower to compute user vectors online.
Two-Tower Model + Self-Supervised Learning
Two-Tower Model Problem
- Recommender systems exhibit strong head effects:
- A small fraction of items account for most clicks.
- Most items have low click counts.
- High-click items learn good representations; long-tail items learn poor representations.
- Self-supervised learning: apply data augmentation to better learn long-tail item vectors.
Review of Two-Tower Model
In-Batch Negative Samples
- A batch has positive sample pairs.
- Form lists; each list has 1 positive pair and negative pairs.
Listwise Training
-
A batch contains positive sample pairs (with clicks):
-
Negative samples: for all .
-
Encourage to be as large as possible; encourage to be as small as possible.
Loss Function
Debiasing
-
Sampling probability of item :
-
Estimated interest of user in item :
-
During training, replace with:
Training the Two-Tower Model
-
Randomly sample (user, item) pairs from click data to form a batch.
-
Two-tower model loss function:
- Apply gradient descent to minimize the loss:
Self-Supervised Learning
-
The two vector representations and of item have high similarity.
-
The vector representations and of items and have low similarity.
-
Encourage to be as large as possible; encourage to be as small as possible.
Feature transformation converts an item's feature vector into another vector.
For example, an item's feature vector might be (audience gender, category, city, occupation) = (0, 5, 1.5, 9). Feature transformation converts (0, 5, 1.5, 9) into, say, (0.9, 7.3, 10.8, 9.6) (a simplified example).
Feature Transformation: Random Mask
- Randomly select some discrete features (e.g., category) and mask them.
- Example:
- An item's category feature is .
- After masking, the category feature becomes . Default represents the missing value.
- Masking means discarding all values of a feature.
- For example, if electronics = 1 and photography = 2, with missing default = 0, the pre-mask category value might be 1.5; after masking, it becomes 0.
- Random mask does not mask all features — only some are randomly masked.
Feature Transformation: Dropout (applies only to multi-valued discrete features)
- An item can belong to multiple categories; category is a multi-valued discrete feature.
- Dropout: randomly discard 50% of feature values.
- Example:
- An item's category feature is .
- After dropout, category becomes .
- For example, if beauty = 1 and photography = 2, the pre-dropout category value might be 1.5; post-dropout it becomes 1.
Feature Transformation: Complementary Features
-
Suppose an item has 4 features:
ID, category, keywords, city
-
Randomly split into two groups: and
-
item representation
-
item representation
-
Since they represent the same item, encourage the two item representations to be similar.
Feature Transformation: Mask a Group of Correlated Features
-
A group of correlated features:
-
Audience gender:
-
Category:
-
and co-occur with high probability .
-
and co-occur with low probability .
-
-
: probability of a feature value being .
-
: probability that one feature is and another is simultaneously.
-
Offline: compute pairwise feature correlations using mutual information (MI): (e.g., computing MI between category and audience gender)
-
Suppose an item has features. Offline: compute pairwise MI for all features, producing a matrix.
-
Randomly select one feature as seed; find the features most correlated with the seed.
-
Mask the seed and its correlated features; keep the remaining features.
-
Example: An item has four features . The seed is keywords, and the most correlated feature is city. After masking: .
-
Pros and cons:
- Pro: Outperforms random mask, dropout, and complementary features.
- Con: Complex method, difficult to implement and maintain.
Training the Model
-
Uniformly sample items from all items to form a batch.
-
Apply two types of feature transformations; the item tower outputs two sets of vectors:
- Loss function for the -th item:
- Self-supervised learning loss function:
- Apply gradient descent to minimize self-supervised learning loss:
Summary
-
Two-tower model learns poor vector representations for low-exposure items.
-
Self-supervised learning:
- Apply random feature transformations to items.
- Feature vectors and have high similarity (same item).
- Feature vectors and have low similarity (different items).
-
Experimental results: recommendations for low-exposure items and new items become more accurate.
-
Randomly sample (user, item) pairs from click data to form a batch.
-
Uniformly sample items from all items to form a batch.
-
Apply gradient descent to minimize loss:
Deep Retrieval
-
The classical two-tower model represents users and items as vectors; online nearest-neighbor search is performed.
-
Deep Retrieval represents items as paths (path); online search finds paths best matching the user.
-
Deep Retrieval is similar to Alibaba's TDM.
Outline
-
Indexes:
- path → List<item>
- item → List<path>
-
Estimation model: neural network estimates user interest in a path.
-
Online retrieval: user → path → item.
-
Training:
- Learn neural network parameters.
- Learn item representations (item → path).
Indexes
Items Represented as Paths
-
Depth: depth = 3 (number of layers).
-
Width: width = K.
-
An item is represented as one path (path), e.g., .
-
An item can be represented as multiple paths, e.g., .
Item-to-Path Index
Index: item → List⟨path⟩
- One item corresponds to multiple paths.
- A path is represented by 3 nodes: path = .
Index: path → List⟨item⟩
- One path corresponds to multiple items.
Estimation Model
Estimating User Interest in a Path
-
A path is represented by 3 nodes: path = .
-
Given user features , estimate user interest in node : .
-
Given and , estimate user interest in node : .
-
Given , estimate user interest in node : .
-
Estimated user interest in path = :
User feature vector is passed through a neural network and then a softmax activation to output the vector. represents the interest scores the network assigns to the nodes at layer L1; the higher the score, the more likely a node is selected. If L1 has nodes, is a -dimensional vector. Node is selected from L1's nodes based on .
Node is embedded to get emb(). The original user feature vector is concatenated with emb() and fed into another neural network, followed by a softmax layer to output . represents interest scores for the nodes at layer L2. Node is selected from L2.
Node is derived analogously.
Online Retrieval
Retrieval: user → path → item
-
Step 1: Given user features, use beam search to retrieve a set of paths.
-
Step 2: Use the index "path → List⟨item⟩" to retrieve a set of items.
-
Step 3: Score and rank the items; select a subset.
Beam Search
-
Suppose there are 3 layers, each with nodes; there are paths total.
-
Scoring all paths with the neural network is too expensive.
-
Beam search reduces computation.
-
Requires setting the hyperparameter beam size.
Select the top 4 nodes at L1 based on the neural network's scores for the nodes.
For each selected node , compute user interest in path :
Compute scores, one for each path; select the top 4 paths.
For each selected pair of nodes , compute user interest in path :
Again compute scores; select the top 4 paths.
Beam Search
- User interest in path = :
- Optimal path:
- Greedy algorithm (beam size = 1) selected path is not necessarily optimal.
Online Retrieval
-
Step 1: Given user features, use the neural network and beam search to retrieve a set of paths.
-
Step 2: Use indexes to retrieve a set of items.
- Look up index path → List⟨item⟩.
- Each path corresponds to multiple items.
-
Step 3: Rank items; select a subset.
Training
Joint Learning of Neural Network Parameters and Item Representations
-
Neural network estimates user interest in path .
-
Represent each item as multiple paths ; build indexes:
- item → List⟨path⟩,
- path → List⟨item⟩.
-
Positive sample (user, item): .
Learning Neural Network Parameters
-
Item is represented by paths: .
-
User interest in path :
-
If the user clicked the item, they are interested in all paths.
-
Should maximize .
-
Loss function: The larger the total interest in paths, the smaller the loss.
Learning Item Representations
- User user's interest in path path = :
- Relevance of item item and path path:
= 1 if the user clicked the item, 0 otherwise.
- Select paths with highest as item's representation.
-
Select paths as the item's representation.
-
Loss function (select paths highly correlated with item):
- Regularization term (avoid too many items clustering on one path):
Greedy Algorithm to Update Paths
-
Suppose the item is already represented by paths . Now update paths in .
-
Fix and select a new from unselected paths: Unselected paths are not limited to the current paths; they can be selected from external candidates. The selection pool can be the top- paths with highest . represents the loss of the path set consisting of and . prevents too many items on one path.
- Selected paths have high score without too many items on the path.
Comparison
Updating the Neural Network
- Neural network predicts user interest in a path:
-
Training data required:
- "Item → path" index.
- Items clicked by the user.
-
If the user clicked the item, and the item corresponds to path , update neural network parameters to increase .
Updating Item Representations
-
Assess item–path relevance:
item user path
-
Associate each item with paths:
- Item and path must have high relevance.
- A path cannot have too many items.
Summary
Retrieval: User → Path → Item
-
Given user features , use the neural network to estimate user interest in path ; score denoted .
-
Use beam search to find the paths with highest score .
-
Use index "" to retrieve items on each path.
-
A total of items are retrieved; do preliminary ranking and return the top-scoring items.
Training: Joint Learning of User–Path and Item–Path Relationships
-
An item is represented by paths: .
-
If the user clicked the item, update neural network parameters to increase:
-
If user interest score is high and the user clicked item , then and have relevance.
-
Find the paths most relevant to , while avoiding too many items on one path.
Other Retrieval Channels
Geo-Location Based Retrieval
GeoHash Retrieval
- Users may be interested in nearby events.
- GeoHash: encoding of latitude/longitude; represents a rectangular geographic area.
- Index: GeoHash → list of quality posts (sorted by time, descending).
- This retrieval channel has no personalization.
Same-City Retrieval
- Users may be interested in events in the same city.
- Index: city → list of quality posts (sorted by time, descending).
- This retrieval channel has no personalization.
Author-Based Retrieval
Followed Author Retrieval
-
Users are interested in posts published by authors they follow.
-
Index:
user → followed authors
author → published posts -
Retrieval:
user → followed authors → most recent posts
Interacted Author Retrieval
-
If the user is interested in a post (like, favorite, share), they may be interested in other posts by that author.
-
Index:
user → interacted authors -
Retrieval:
user → interacted authors → most recent posts
Similar Author Retrieval
-
If a user likes a certain author, they like similar authors.
-
Index:
author → similar authors (k authors) -
Retrieval:
user → interested authors (n authors) → similar authors (nk authors) → most recent posts (nk posts)
Cache-Based Retrieval
Cache-Based Retrieval
Idea: reuse full-ranking results from the previous recommendation sessions.
-
Background:
- Full ranking outputs hundreds of posts and sends them to re-ranking.
- Re-ranking applies diversity sampling and selects a few dozen.
- More than half of full-ranking results are never exposed — wasted.
-
Cache posts ranked in the top 50 by full ranking that were not exposed; use them as a retrieval channel.
Cache has a fixed size and requires an eviction mechanism.
- Once a post is successfully exposed, it is evicted from the cache.
- If the cache exceeds capacity, remove the oldest-entered post.
- A post can be retrieved at most 10 times; it is evicted after 10 retrievals.
- Each post is stored for at most 3 days; evicted after 3 days.
Exposure Filtering & Bloom Filter
Exposure Filtering Problem
-
If a user has seen an item, don't expose it to them again.
-
For each user, record items already exposed to them. (Xiaohongshu only retrieves posts published within the past month, so only the past month's exposure history needs to be recorded per user.)
-
For each retrieved item, determine if it has already been exposed to that user; exclude previously exposed items.
-
If a user has seen items and items are retrieved, brute-force comparison requires time.
Bloom Filter
-
Bloom filter determines whether an item ID is in the set of already-exposed items.
-
If the answer is no, that item is definitely not in the set.
-
If the answer is yes, the item is very likely in the set. (May have false positives: a non-exposed item is incorrectly judged as exposed and filtered out.)
-
Bloom filter represents an item set as an -dimensional binary vector.
-
Each user has a set of exposed items, represented as a binary vector requiring bits of storage.
-
Bloom filter has hash functions; each maps an item ID to an integer between and .
Bloom Filter
-
Exposed item set size is ; binary vector dimension is ; uses hash functions.
-
Bloom filter false positive rate: .
-
Larger means more 1s in the vector; higher false positive rate. (The probability that an unexposed item's hash positions are all 1 increases.)
-
Larger means a longer vector; less likely hash collisions occur.
-
too large or too small is suboptimal; has an optimal value.
-
-
Given tolerable false positive rate , optimal parameters are:
Bloom Filter Drawbacks
-
Bloom filter represents an item set as a binary vector.
-
Adding an item to the set only requires setting positions in the vector to 1. (If already 1, no change.)
-
Bloom filter only supports adding items; it does not support deleting items. Removing an item from the set cannot undo its effect on the vector.
-
Every day, items older than 1 month need to be removed from the item set. (Expired items can never be retrieved; no need to record them in the Bloom filter. Reducing lowers the false positive rate.)
贡献者
最近更新
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0