Do I need to understand autograd to use PyTorch effectively?

Yes. You don't need to implement it, but understanding that PyTorch builds a computation graph on every forward pass — and that calling .backward() traverses it to compute gradients — is essential for debugging training issues and writing custom layers.

What's the difference between torch.nn.Module and a plain function?

nn.Module holds learnable parameters (registered via nn.Parameter) and sub-modules. PyTorch automatically tracks these for .parameters(), .to(device), and .eval()/.train() calls. Plain functions are fine for stateless transforms but can't hold weights.

Should I use DataLoader with num_workers > 0 on Windows?

Yes, but you must wrap your dataset and training code in if __name__ == '__main__': guard. Without it, Python's multiprocessing on Windows will re-import your module and spawn infinite workers.

How do I choose between Adam and SGD with momentum?

Start with Adam (lr=1e-3) for fast convergence during prototyping. Switch to SGD with momentum (lr=1e-2, momentum=0.9) and a cosine schedule for final training — it often generalizes better on production models.

Blog/ML / DL

PyTorch Training Fundamentals: From Tensors to Custom Datasets

Stop copy-pasting PyTorch boilerplate. Understand tensors, autograd, datasets, and training loops from the ground up — the way an engineer should.

| Intermediate

Rajath KumarEdge AI Engineer & Founder, Analog Data

2026-06-27·12 min read

PyTorch Training Fundamentals: From Tensors to Custom Datasets

Why Most Engineers Get PyTorch Wrong

They follow a tutorial, get a model training, and never go deeper. When training diverges or gradients explode, they're stuck — because they never understood what's happening underneath.

This post covers the four pillars you need to actually use PyTorch:

Tensors — the data structure
Autograd — the automatic differentiation engine
Datasets & DataLoaders — the data pipeline
The training loop — the thing you'll write 1000 times

1. Tensors: More Than Just Arrays

A tensor is a multi-dimensional array with three extra superpowers: GPU acceleration, gradient tracking, and seamless NumPy interop.

python

1import torch
2
3# Create from data
4x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
5
6# Move to GPU
7if torch.cuda.is_available():
8    x = x.to("cuda")
9
10# NumPy interop — zero copy
11import numpy as np
12arr = x.numpy()  # shares memory if on CPU

Key gotcha: Operations are element-wise by default. Use @ or torch.matmul() for matrix multiplication, not *.

python

1a = torch.tensor([[1, 2], [3, 4]])
2b = torch.tensor([[5, 6], [7, 8]])
3
4# Element-wise (Hadamard) — rarely what you want
5print(a * b)  # [[5, 12], [21, 32]]
6
7# Matrix multiplication — what you usually want
8print(a @ b)  # [[19, 22], [43, 50]]

2. Autograd: The Engine That Makes Deep Learning Possible

Every operation on a tensor with requires_grad=True is recorded in a computation graph. Call .backward() and PyTorch traverses that graph to compute gradients for every parameter.

python

1# Simple example: minimize f(x) = x²
2x = torch.tensor(2.0, requires_grad=True)
3
4for step in range(100):
5    y = x ** 2          # forward: build graph
6    y.backward()         # backward: compute gradients
7    with torch.no_grad():
8        x -= 0.1 * x.grad  # gradient descent step
9    x.grad = None         # clear gradients for next iteration
10
11print(x.item())  # ≈ 0.0

Three rules to remember:

with torch.no_grad(): — use this when updating weights. You don't want the update itself tracked in the graph.
x.grad = None or optimizer.zero_grad() — gradients accumulate by default. Clear them each iteration.
x.detach() — pulls a tensor out of the graph. Useful for logging or when using a value as input without tracking.

3. Custom Datasets and DataLoaders

The Dataset class is simple — implement __len__ and __getitem__:

python

1from torch.utils.data import Dataset, DataLoader
2
3class TimeSeriesDataset(Dataset):
4    def __init__(self, data, labels, sequence_length=30):
5        self.data = data
6        self.labels = labels
7        self.seq_len = sequence_length
8
9    def __len__(self):
10        return len(self.data) - self.seq_len
11
12    def __getitem__(self, idx):
13        x = self.data[idx:idx + self.seq_len]
14        y = self.labels[idx + self.seq_len]
15        return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)
16
17# DataLoader handles batching, shuffling, and parallel loading
18loader = DataLoader(
19    TimeSeriesDataset(data, labels),
20    batch_size=32,
21    shuffle=True,
22    num_workers=4,
23    pin_memory=True,  # speeds up CPU→GPU transfer
24)

pin_memory=True allocates data in page-locked memory, which makes GPU transfers faster. Always use it when training on GPU.

4. The Training Loop — Written Properly

Here's the loop you'll use in every project, with the parts that matter:

python

1import torch.nn as nn
2import torch.optim as optim
3
4class SimpleModel(nn.Module):
5    def __init__(self, input_dim, hidden_dim, output_dim):
6        super().__init__()
7        self.fc1 = nn.Linear(input_dim, hidden_dim)
8        self.fc2 = nn.Linear(hidden_dim, output_dim)
9        self.relu = nn.ReLU()
10
11    def forward(self, x):
12        return self.fc2(self.relu(self.fc1(x)))
13
14device = "cuda" if torch.cuda.is_available() else "cpu"
15model = SimpleModel(30, 64, 1).to(device)
16optimizer = optim.Adam(model.parameters(), lr=1e-3)
17criterion = nn.MSELoss()
18
19epochs = 50
20
21for epoch in range(epochs):
22    # Training
23    model.train()
24    train_loss = 0.0
25    for batch_x, batch_y in train_loader:
26        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
27
28        optimizer.zero_grad()          # 1. Clear gradients
29        outputs = model(batch_x)       # 2. Forward pass
30        loss = criterion(outputs, batch_y)  # 3. Compute loss
31        loss.backward()                # 4. Backward pass (compute gradients)
32        optimizer.step()               # 5. Update weights
33        train_loss += loss.item()
34
35    # Validation
36    model.eval()
37    val_loss = 0.0
38    with torch.no_grad():              # No gradient tracking during eval
39        for batch_x, batch_y in val_loader:
40            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
41            outputs = model(batch_x)
42            val_loss += criterion(outputs, batch_y).item()
43
44    print(f"Epoch {epoch+1}/{epochs} | Train: {train_loss/len(train_loader):.4f} | Val: {val_loss/len(val_loader):.4f}")

The five steps inside the training loop are sacred:

optimizer.zero_grad() — clear old gradients
Forward pass — compute predictions
Compute loss — measure how wrong you are
loss.backward() — compute gradients via autograd
optimizer.step() — update weights

Get these right and 90% of training bugs disappear.

5. Saving and Loading Models

python

1# Save everything (model + optimizer state)
2checkpoint = {
3    "epoch": epoch,
4    "model_state": model.state_dict(),
5    "optimizer_state": optimizer.state_dict(),
6    "loss": val_loss,
7}
8torch.save(checkpoint, "checkpoint.pt")
9
10# Load
11checkpoint = torch.load("checkpoint.pt", map_location=device)
12model.load_state_dict(checkpoint["model_state"])
13optimizer.load_state_dict(checkpoint["optimizer_state"])

For inference only (smaller file, no optimizer state):

python

1# Save
2torch.save(model.state_dict(), "model_weights.pt")
3
4# Load
5model = SimpleModel(30, 64, 1)
6model.load_state_dict(torch.load("model_weights.pt", map_location=device))
7model.eval()

Summary

Tensors: GPU-accelerated arrays with gradient tracking
Autograd: automatic differentiation — call .backward() and PyTorch handles the rest
Datasets/DataLoaders: implement __len__ and __getitem__, let DataLoader handle batching
Training loop: zero_grad → forward → loss → backward → step. Every. Single. Time.
Save checkpoints: include optimizer state for resuming training

Master these fundamentals and the rest of PyTorch — distributed training, mixed precision, custom layers — becomes incremental learning, not a wall of confusion.

Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

#PyTorch #Deep Learning #ML / DL #Training #Python

Frequently Asked Questions

Quick answers to common questions

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

Bengaluru, India

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.