PyTorch Training Fundamentals: From Tensors to Custom Datasets
Stop copy-pasting PyTorch boilerplate. Understand tensors, autograd, datasets, and training loops from the ground up — the way an engineer should.
Why Most Engineers Get PyTorch Wrong
They follow a tutorial, get a model training, and never go deeper. When training diverges or gradients explode, they're stuck — because they never understood what's happening underneath.
This post covers the four pillars you need to actually use PyTorch:
- Tensors — the data structure
- Autograd — the automatic differentiation engine
- Datasets & DataLoaders — the data pipeline
- The training loop — the thing you'll write 1000 times
1. Tensors: More Than Just Arrays
A tensor is a multi-dimensional array with three extra superpowers: GPU acceleration, gradient tracking, and seamless NumPy interop.
1import torch
2
3# Create from data
4x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
5
6# Move to GPU
7if torch.cuda.is_available():
8 x = x.to("cuda")
9
10# NumPy interop — zero copy
11import numpy as np
12arr = x.numpy() # shares memory if on CPUKey gotcha: Operations are element-wise by default. Use @ or torch.matmul() for matrix multiplication, not *.
1a = torch.tensor([[1, 2], [3, 4]])
2b = torch.tensor([[5, 6], [7, 8]])
3
4# Element-wise (Hadamard) — rarely what you want
5print(a * b) # [[5, 12], [21, 32]]
6
7# Matrix multiplication — what you usually want
8print(a @ b) # [[19, 22], [43, 50]]2. Autograd: The Engine That Makes Deep Learning Possible
Every operation on a tensor with requires_grad=True is recorded in a computation graph. Call .backward() and PyTorch traverses that graph to compute gradients for every parameter.
1# Simple example: minimize f(x) = x²
2x = torch.tensor(2.0, requires_grad=True)
3
4for step in range(100):
5 y = x ** 2 # forward: build graph
6 y.backward() # backward: compute gradients
7 with torch.no_grad():
8 x -= 0.1 * x.grad # gradient descent step
9 x.grad = None # clear gradients for next iteration
10
11print(x.item()) # ≈ 0.0Three rules to remember:
with torch.no_grad():— use this when updating weights. You don't want the update itself tracked in the graph.x.grad = Noneoroptimizer.zero_grad()— gradients accumulate by default. Clear them each iteration.x.detach()— pulls a tensor out of the graph. Useful for logging or when using a value as input without tracking.
3. Custom Datasets and DataLoaders
The Dataset class is simple — implement __len__ and __getitem__:
1from torch.utils.data import Dataset, DataLoader
2
3class TimeSeriesDataset(Dataset):
4 def __init__(self, data, labels, sequence_length=30):
5 self.data = data
6 self.labels = labels
7 self.seq_len = sequence_length
8
9 def __len__(self):
10 return len(self.data) - self.seq_len
11
12 def __getitem__(self, idx):
13 x = self.data[idx:idx + self.seq_len]
14 y = self.labels[idx + self.seq_len]
15 return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)
16
17# DataLoader handles batching, shuffling, and parallel loading
18loader = DataLoader(
19 TimeSeriesDataset(data, labels),
20 batch_size=32,
21 shuffle=True,
22 num_workers=4,
23 pin_memory=True, # speeds up CPU→GPU transfer
24)pin_memory=True allocates data in page-locked memory, which makes GPU transfers faster. Always use it when training on GPU.
4. The Training Loop — Written Properly
Here's the loop you'll use in every project, with the parts that matter:
1import torch.nn as nn
2import torch.optim as optim
3
4class SimpleModel(nn.Module):
5 def __init__(self, input_dim, hidden_dim, output_dim):
6 super().__init__()
7 self.fc1 = nn.Linear(input_dim, hidden_dim)
8 self.fc2 = nn.Linear(hidden_dim, output_dim)
9 self.relu = nn.ReLU()
10
11 def forward(self, x):
12 return self.fc2(self.relu(self.fc1(x)))
13
14device = "cuda" if torch.cuda.is_available() else "cpu"
15model = SimpleModel(30, 64, 1).to(device)
16optimizer = optim.Adam(model.parameters(), lr=1e-3)
17criterion = nn.MSELoss()
18
19epochs = 50
20
21for epoch in range(epochs):
22 # Training
23 model.train()
24 train_loss = 0.0
25 for batch_x, batch_y in train_loader:
26 batch_x, batch_y = batch_x.to(device), batch_y.to(device)
27
28 optimizer.zero_grad() # 1. Clear gradients
29 outputs = model(batch_x) # 2. Forward pass
30 loss = criterion(outputs, batch_y) # 3. Compute loss
31 loss.backward() # 4. Backward pass (compute gradients)
32 optimizer.step() # 5. Update weights
33 train_loss += loss.item()
34
35 # Validation
36 model.eval()
37 val_loss = 0.0
38 with torch.no_grad(): # No gradient tracking during eval
39 for batch_x, batch_y in val_loader:
40 batch_x, batch_y = batch_x.to(device), batch_y.to(device)
41 outputs = model(batch_x)
42 val_loss += criterion(outputs, batch_y).item()
43
44 print(f"Epoch {epoch+1}/{epochs} | Train: {train_loss/len(train_loader):.4f} | Val: {val_loss/len(val_loader):.4f}")The five steps inside the training loop are sacred:
optimizer.zero_grad()— clear old gradients- Forward pass — compute predictions
- Compute loss — measure how wrong you are
loss.backward()— compute gradients via autogradoptimizer.step()— update weights
Get these right and 90% of training bugs disappear.
5. Saving and Loading Models
1# Save everything (model + optimizer state)
2checkpoint = {
3 "epoch": epoch,
4 "model_state": model.state_dict(),
5 "optimizer_state": optimizer.state_dict(),
6 "loss": val_loss,
7}
8torch.save(checkpoint, "checkpoint.pt")
9
10# Load
11checkpoint = torch.load("checkpoint.pt", map_location=device)
12model.load_state_dict(checkpoint["model_state"])
13optimizer.load_state_dict(checkpoint["optimizer_state"])For inference only (smaller file, no optimizer state):
1# Save
2torch.save(model.state_dict(), "model_weights.pt")
3
4# Load
5model = SimpleModel(30, 64, 1)
6model.load_state_dict(torch.load("model_weights.pt", map_location=device))
7model.eval()Summary
- Tensors: GPU-accelerated arrays with gradient tracking
- Autograd: automatic differentiation — call
.backward()and PyTorch handles the rest - Datasets/DataLoaders: implement
__len__and__getitem__, let DataLoader handle batching - Training loop: zero_grad → forward → loss → backward → step. Every. Single. Time.
- Save checkpoints: include optimizer state for resuming training
Master these fundamentals and the rest of PyTorch — distributed training, mixed precision, custom layers — becomes incremental learning, not a wall of confusion.
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.