Enqurious logo
Back to blog

Embeddings - Useful or Hype?

Embeddings - Useful or Hype? blog cover image
Amit ChoudharyCo-founder & CEO
Mentee : What are vector embeddings?
Mentor : Vector embeddings, often just called "embeddings", are a type of representation for categorical data where each category (or item) is represented as a point in a continuous vector space. The goal is to place similar items close together and dissimilar items far apart in this space. Each item gets represented by a vector, which can be thought of as a list of numbers.

Cricket Batsmen Example :

Imagine we have a list of famous cricket batsmen:


Now, let's say we want to represent each of these batsmen in a 2-dimensional vector space based on their playing style. Here's a hypothetical situation:

  1. Sachin Tendulkar and Virat Kohli have a similar playing style. So, they should be close in the vector space.

  2. Brian Lara and Ricky Ponting, being aggressive players, should also be close to each other but distant from Tendulkar and Kohli.

  3. AB de Villiers, known for his innovative shots, might be somewhere different.

  4. Babar Azam, being a technically sound player like Tendulkar and Kohli, should be closer to them.

Embedding Representation :

Let's represent each batsman with a 2D vector as shown in the below code snippet

import matplotlib.pyplot as plt

# Batsmen and their hypothetical vector embeddings
batsmen = ["Sachin Tendulkar", "Brian Lara", "Ricky Ponting", "AB de Villiers", "Virat Kohli", "Babar Azam"]
vectors = [(1, 2), (4, 5), (5, 4), (3, 3), (2, 1), (1.5, 2.5)]

# Extract x and y coordinates
x, y = zip(*vectors)

plt.figure(figsize=(10, 7))
plt.scatter(x, y, color='blue', s=100)
for i, name in enumerate(batsmen):
    plt.annotate(name, (x[i] + 0.1, y[i]), fontsize=10)

plt.xlabel('Style Dimension 1')
plt.ylabel('Style Dimension 2')
plt.title('Hypothetical Vector Embeddings of Cricket Batsmen')

Let's check out the result :

batsmen.pngHere's the visual representation of our hypothetical vector embeddings for the cricket batsmen based on their playing styles. As you can see:

  • Sachin Tendulkar and Virat Kohli are close to each other, representing their similarity in playing style.

  • Brian Lara and Ricky Ponting are also near each other, indicating their aggressive playing styles.

  • AB de Villiers is somewhat in the middle, signifying his unique playing style.

  • Babar Azam is closer to Tendulkar and Kohli, hinting at his technical prowess.

Actual Creation of Vector Embeddings :

The above example was a simplification. In reality, embeddings are generated using algorithms that process vast amounts of data. For instance, in the context of words, algorithms like Word2Vec or GloVe use co-occurrence statistics or neural networks to generate embeddings. For cricket batsmen, one could use match statistics, expert reviews, or other data sources to create embeddings that capture the nuances of each player's style and performance.

The idea is that the position of each point (batsman) in the vector space captures some semantic meaning about that batsman's playing style or abilities.

Why are Embeddings Useful?

Embeddings can convert categorical data (like batsman names) into numerical vectors, which can then be fed into machine learning algorithms. This allows models to recognize patterns based on the relationships between the data points in the embedding space. For example, a model could identify that players similar to Tendulkar might have specific strengths or weaknesses, based on the patterns observed in the data for players close to his position in the embedding space.

Mentee : Once we create these embeddings, How to use them? Do they have any practical applications?
Mentor : Vector embeddings have a wide range of practical applications, especially in the field of machine learning and data analysis. Here are some ways you can use embeddings :

1. Similarity & Recommendation :

Embeddings can be used to measure similarity between items. If two vectors are close in the embedding space, they are considered similar. This is fundamental in recommendation systems.

  • Example: If a user likes a particular batsman (say, Sachin Tendulkar), you could recommend other batsmen who are close to Sachin in the embedding space (like Virat Kohli, based on our previous example).

2. Data Visualization :

By reducing embeddings to 2D or 3D using techniques like t-SNE or PCA, you can visualize complex datasets and infer relationships.

3. Input to Machine Learning Models :

Instead of using one-hot encoding or label encoding for categorical data, embeddings can be used as input features to machine learning models. This is especially popular in deep learning.

  • Example: Predicting a batsman's score in the next match based on his embedding, along with other features.

4. Clustering :

Embeddings can be used to group similar items together. You can run clustering algorithms (like K-means) on the embeddings to segment your data.

  • Example: Cluster batsmen based on their playing style or performance.

5. Transfer Learning :

Pre-trained embeddings can be used as a starting point for other tasks. This is common in natural language processing where embeddings trained on one task are used for another.

  • Example: Use batsmen embeddings trained on commentary data to predict match outcomes.

6. Semantic Search :

Instead of traditional keyword-based search, you can search based on the semantic meaning. When a query is converted to its embedding, you can retrieve items that are close in the embedding space.

  • Example: Searching for batsmen similar to a given player.

7. Anomaly Detection :

Items that are distant from all others in the embedding space can be considered anomalies or outliers.

  • Example: Identifying a batsman who has a unique playing style that's different from all others.

8. Sequence Models :

In the context of natural language processing or time series, embeddings can be fed into sequence models like RNNs, LSTMs, or Transformers to capture sequential dependencies.

  • Example: Predicting the next word in a cricket commentary.

9. Content Generation :

Embeddings can be used in generative models to produce new content.

  • Example: Generate commentary text based on the style of a particular batsman.

In summary, once you have embeddings, they serve as a dense, continuous representation of your data that captures semantic relationships. This representation can be leveraged in numerous ways, from simple similarity calculations to complex deep learning models.

Mentee : I've got prepared a summary table for IPL teams which looks like this :

Screenshot 2023-10-04 063821.png

Can I say each row is a vector embedding for a given team?

Mentor : Yes, you can say that each row is a vector representation for a given team based on the provided features (like runs made, top 3 batting average, wickets lost, etc.). However, a couple of clarifications are in order :

1. Nature of the Vector: In the table you've provided, each row is essentially a feature vector for the team, derived directly from raw data or computed statistics. This is different from embeddings which are often derived from more complex processes (like neural networks, matrix factorization, etc.) and aim to capture deeper, often non-linear relationships in the data.

2. Dimensionality: The dimensionality of the vector is defined by the number of features you have. In your table, each team is represented as a 6-dimensional vector.

3. Normalization: Before using these vectors in many machine learning algorithms, you might consider normalizing or standardizing the values, especially if you're going to measure distances or similarities between the vectors. This ensures that no particular feature dominates the others simply because of its scale.

4. Embeddings vs. Feature Vectors: While the terms are sometimes used interchangeably, "embeddings" typically refer to vectors that have been learned or derived in order to capture semantic relationships, often in a lower-dimensional space than the original feature set. On the other hand, the rows in your table are more straightforward feature vectors based on raw or aggregated statistics.

That said, the concept is similar: each team is represented as a point in a multi-dimensional space, and the position of that point is determined by the team's performance metrics. This representation allows for comparison, clustering, and other analyses based on the provided features.

Mentee : Thank you Mentor, this was insightful. I learnt something new today :)