A very gentle and naive introduction to Natural Language Processing

At its very heart, computers can only understand numbers. This triggers a main challenge: finding a way to communicate the messy and nuanced human language to a more precise, numerical interpretation.

We achieve this by creating vector representations—essentially, turning text into lists of numbers that capture some aspect of its meaning.

What is NLP?

Natural Language Processing (NLP) is a field of linguistics and AI that gives machines the ability to understand text and spoken words in much the same way as a human would. Our aim here is to bridge the gap between unstructured text and structured data that algorithms can process.

In an extremely naive definition, NLP turns words into numbers—vectors, tensors, graphs, or trees—so machines can analyze sentiment, translate languages, or even generate text.

Most NLP tasks use supervised learning, where models learn from labeled data. A simple way of breaking this out:

Observations (x): sentences or documents—input
Targets (y): labels, e.g., “positive” for sentiment
Model (f): A function predicting y from x using learned parameters (w)
Loss Function (L): Measures prediction error—aim to minimize this
Optimizer: Like Stochastic Gradient Descent, it adjusts w to improve model results

Some Machine Learning Basics

Before deep diving in, these are a few ML essentials for better understanding what’s coming next.

Machine Learning (ML) allows computers spot patterns in data without explicit programming. In NLP, ML is fundamental for building models like decision trees or neural networks parting from large datasets (corpora for NLP purposes).

Traditionally, text tasks like classification—spam vs. not spam—had two main steps:

Feature Extraction: Turning documents into vectors
Classification: Apply algorithms like SVM or k-NN

Today, deep learning often combines these and allows us to build end-to-end neural models. Though, classical methods still shine for their simplicity.

Some key ML algorithms for text:

k-Nearest Neighbors (k-NN): Labels based on the majority of k closest “neighbors” in a vector space
Support Vector Machines (SVM): Finds a hyperplane separating classes with max margin, high-dimensional data
Naive Bayes: Probabilistic method, uses conditional probabilities

Modern NLP learns on neural nets, but these are some basic foundations.

Turning Text into Vectors: Bag of Words

Computers don’t understand text—they “think” numbers. So, we need to represent words and documents as vectors to capture meaning.

This principle is based on distributional semantics: You shall know a word by the company it keeps (J.R. Firth, 1957). Words in similar contexts have similar meanings.

Bag of Words (BoW) is a simple model that treats a document as a “bag” of terms, no matter the order. Each word is a dimension in a vector.

How it works:

We create a big vocabulary from every unique word in the dataset
Each document is then represented by a massive vector where each position corresponds to a word from the vocabulary
The values assigned to each position is simply the count of how many times that word appears in the document—this is known as Term Frequency

What makes this great? BoW is incredibly effective for basic tasks like spam detection or news classification. It’s fast, intuitive, and provides a strong starting point.

What makes this not that great? It completely ignores grammar, word order, and context.

Smart Counting

As you might’ve noticed, there’s a quite noticeable issue with raw counts used in TF. In a sentence “A dog scares a cat,” the word “a” appears twice, but it’s not that informative.

Weighting is key—raw frequencies aren’t enough. Term importance rises with frequency in a document but falls if it’s common across the corpus—”the”, “a”, “an”, are useless. So, there must be a way to weigh words by importance.

This is where Term Frequency-Inverse Document Frequency (TF-IDF) stars. TF-IDF is a cleverer scoring scheme than simple counts and has been the standard in NLP for a while.

The idea behind this is:

A word is important if it appears frequently in a single document—Term Frequency
It becomes less important if it appears in many documents across the entire collection—Inverse Document Frequency

A cute itty-bitty math expression:

\[W_{x,y} = tf_{x,y} \times \log\left(\frac{N}{df_x}\right)\]

where,

$tf_{x,y}$ = frequency of x in y
$df_x$ = number of documents containing x
$N$ = total number of documents
$W_{x,y}$ = weight assigned to the term x within document y

TF-IDF is a small change that makes BoW much more powerful.

Filling Up the Bag

BoW has some evident flaws:

Treats synonyms as separate
Can’t handle polysemy
Vectors are sparse and high-dimensional, making learning tough

To patch these weaknesses, we can add some denser layers:

Part-of-Speech (POS) Tagging: Labels words as noun, verb, etc.—e.g., “plant” noun vs. verb
Collocations/Phrases: Treat “hot dog” as one unit, not separate words
Named Entity Recognition (NER): Spots entities like “Mexico” (location) or “Apple” (organization)
N-grams: Sequences of n words (e.g., bigrams: “hot dog”). Captures order but triggers high-dimensions

Embeddings Revolution

Modern NLP uses a radical new approach, supported completely on the distributional semantics foundation. The idea that words that appear in similar contexts probably have similar meanings. What if instead of computing word counts we could build a representation of a word’s meaning based on its context, and learn from it.

We are now moving from sparse, high-dimensional vectors to dense, low-dimensional vectors called Word Embeddings.

Instead of a vector of 10,000+ dimensions (mostly zeros), imagine representing each word by a vector of only 300 dense, continuous numbers; these aren’t random, they encode semantic meaning.

Following this idea, the word “coffee” will often be near words like “cup,” “brew,” and “morning.” The word “tea” will be found in a very similar context, therefore, the vectors for “coffee” and “tea” should be very close together in this 300-dimensional space.

This leads us to the amazing property of being able to do arithmetic with words. A famous example used for representing this:

king - man + woman = queen

Embeddings capture not just similarity but also relationships between words.

But how are they created? Models like Word2Vec use a simple neural network to learn these embeddings. The network is trained to predict a word given its surroundings—or vice versa. By doing this over a huge amount of collected data, it learns these semantic vector representations.

NLP in a Nutshell

BoW computes simple word counts, it is effective but blind to context
TF-IDF and other techniques allow us to add basic context weighting
Word Embeddings are the leap to dense vectors that capture meaning and relationships among words
Some advanced Neural Model Architectures—like RNN’s or Transformers—use embeddings to understand language with unprecedented context and nuance

Despite advances in modern Language Models (LMs), many fundamental challenges remain: understanding ambiguity, cultural context, sarcasm, humor, and so on. LMs address these challenges by training massively on diverse data sets, but they often still fall short of human understanding in many complex scenarios.