Modules

tensor

The tensor module contains the main Tensor class responsible for providing automatic differentiation capability.

The Tensor object is a NumPy array holding gradients, supporting a variety of common operations, computing gradients when request.

Values and Gradients

In its simplest form, Tensors are initialized with a value. Values accepted are (non-complex) numeric types, lists, and NumPy arrays.:

a = Tensor(value=1)
b = Tensor(np.random.normal(0,1,(100,100)))
c = Tensor([2])

Gradients of each Tensor are initialized 0, stored in .grad, and are of the same shape as the passed-in value. Gradients are 0 until backpropagation is manually called upon either the Tensor or a referee:

print(a.value, a.grad)
# -> 1.0 0.0

Operations and Topological Graphs

Tensors support a variety of operations as methods:

  • dunder methods: __add__, __sub__, __neg__, __truediv__, __mul__, __matmul__, __pow__

  • other common ops: sum, reshape, transpose, T, mean, std, conv2D, mask_idcs

  • activations: relu, tanh, sigmoid, softmax

  • NumPy-level broadcasting and slicing

For example, the below are all valid:

x = Tensor(value=np.ones((10,20,30)))

# built-ins
y = x + 2*x - x/2 + 3
y = (x**2)@np.ones((10,30,20))
y = x @ np.ones((1,20,30))      # broadcasting is performed
y = (-1*(x+2*x-3*x)/4 @ np.ones((10,30,20))) **2

# other
y = x.sum(axis=0); x.sum(axis=(0,1,2)); x.sum()
y = x.mean() + x.std()
y = x.transpose(); x.T
y = x.reshape((30,-1))
x2     = np.ones((1, 3, 32, 32))
kernel = np.ones((1, 1, 3, kH=5, kW=5))
kernel.Conv2D(x2)

# activations
x.relu()
x.tanh()
x.reshape((-1,20,1)).sigmoid()
x.softmax()

Applying an operation on a Tensor will always produce a new Tensor whose operands are those Tensors’ children. By applying successive operations on a Tensor, a computational graph is built and stored, thus, each Tensor has memory of the Tensors used to create it. The direct children of a Tensor can be found in its ._prev attribute.

A Tensor’s computational graph can be generated at any point in time with .create_graph(). This method returns two computational graphs: a complete topological graph, topo and a weights-only graph weights.

Both graphs are reverse-ordered (pre-order traversed) lists of operands performed on self, with self as root.

  • topo stores all Tensors that interacted to produce the Tensor.

  • weights only Tensors with learnable weights that produced the Tensor.

The following illustrates their difference:

x = Tensor(value=1, label='x')
y = x**2 + x + 1
topo, weights = y.create_graph()

In the above, 3 operations are performed on a created Tensor x, thus, topo is a list with the following elements in this order:

  1. x

  2. x**2

  3. x**2 + x

  4. y=x**2 + x + 1

weights however only contains the x Tensor, as in producing y, only one Tensor had learnable weights, with all intermediary Tensors not contributing uniquely to y.

Backwards propagation

Having a set of weights for a Tensor allows performing backpropagation on that Tensor, updating only those Tensors whose values directly contribute, ignoring the rest. This is how backward() is implemented in the library.

Backpropagation can be directly performed on a Tensor at any time using the .backward() method. This populates the gradients of all contibuting weights to that Tensor.

x = Tensor(value=1, label='x')
y = x**2 + x + 1
y.backward()
print(x.grad)
# -> 3.0
y.backward()
print(x.grad)
# -> 3.0
y.backward(reset_grad=False)
print(x.grad)
# -> 11.0

Backprop can be applied as many times as needed on a Tensor, however will default to resetting all previous backwards passes. To perform backpropagation multiple times on the same computational graph, set reset_grad=False. Each new backprop adds the previous gradients to the new one, using these added gradients for gradient computations further down the computational graph.

Gradient Descent example

The following shows a simple example of performing gradient descent on the Tensor x=1.

from pygrad.tensor import Tensor

n_iters = 1000
stepsize= 0.01
x       = Tensor(1)

for _ in range(n_iters):
    loss_fn = (x-1.5)**2
    loss_fn.backward(reset_grad=True)
    x.value = x.value - stepsize*x.grad

print(x.value, loss_fn.value)
# -> 1.4336... 0.0045...

For performing backprop automatically with more complex functions, use the optims module. Vectorized backprop with batched data is also supported, however to not modify the underlying model whilst performing vectorized forward and backwards passes will require creating a subclass of Module.

For all Tensor methods see: Tensor Methods

basics, activations, and losses

The basics, activations, and losses modules extend the functionality of the Tensor, by providing Pytorch-like classes that create a variety of higher-order Tensors commonly used in deep-learning.

These include:

  • Dropout, AddNorm, Linear, Softmax, Flatten, Conv2D layers,

  • ReLU activation,

  • BCELoss, CCELoss losses.

The above classes contain no dependencies other than the Tensor object and NumPy. Since Tensors``s use ``NumPy arrays under the hood, creating custom classes is thus very simple. For example, defining Dropout is done as follows:

from pygrad.tensor import Tensor
import numpy as np

class Dropout:
    def __init__(self, rate:float=0.1):
        self.rate = rate

    def __call__(self, x:Tensor, training:bool=True) -> Tensor:
        if training:
            n_points        = int(np.prod(x.shape)*self.rate)
            arr_indices     = np.unravel_index(np.random.choice(np.arange(0, np.prod(x.shape)),
                                                size=n_points,
                                                replace=False), x.shape)
            dropouted_pts   = x.mask_idcs(arr_indices)
            return dropouted_pts
        else:
            return x

Backpropagation can now be done using this Class, no different than with any other Tensor:

x = Tensor(np.array([1,1,1,1]))
d = Dropout(0.5)
otp = d(x)
otp.value
# -> array([1., 0., 0., 1.])
otp.backward()
x.grad
# -> array([1., 0., 0., 1.])

For all methods in basics, activations, and losses, see: basics, activations, losses.

optims

Classes for gradient descent such as SGD, SGD with Momentum, RMSProp, and Adam are defined here. Optimizers are designed to work with the weights of a Tensor (called a model), each having a .zero_grad method for resetting Tensor gradients, a .step method for updating model weights given a loss function, and a .step_single method for updating model weights progressively in a memory-sensitive manner when model weights are large. This method is further explained under module.

Basic usage is the same across all optimizers; initialize the optimizer with the model weights along with optimizer-specific parameters; reset the model gradient; do a forward pass and a backwards pass with a specified loss function; and step with the optimizer, feeding in the loss function.

from pygrad.tensor import Tensor
from pygrad.optims import SGD

x     = Tensor([1])
y     = x**2 + 1
model = y.create_graph()[1]                # fetching .weights from Tensor y
optim = SGD(model, lr=0.01)

for _ in range(100):
    optim.zero_grad()
    y    = x**2 + 1
    loss = (y-1.5)**2
    loss.backward()
    optim.step(loss)

print(x.value, y.value, loss.value)
# -> 0.7100436 1.50433688 1.88085134e-05

For all available options, see: optims options.

module

The Module class gives the ability to perform batched forward and backward passes on the model without mutating the model. Functions defined as classes representing models can also easily use optimizers as defined in optims.

Below shows how to convert a class-defined function into one subclassing Module.

class DNN:
    """Dense Neural Network, (28,28) -> (10) """
    def __init__(self, dtype=np.float32):

        self.dtype          = dtype
        self.flatten        = Flatten()
        self.dense1         = Linear(i_dim=28*28, o_dim=100)
        self.relu1          = ReLU()
        self.dense2         = Linear(i_dim=100, o_dim=10)

    def forward(self, x:Tensor):
        x = self.flatten(x)
        x = self.dense1(x)
        x = self.relu1(x)
        x = self.dense2(x)
        return x
The following have to now take place:
  1. Subclassing (pygrad.module.Module)

  2. A line super().__init__, passing in the expected model forward-pass inputs, with:
    • each input that is of type Tensor has to have set leaf=True

  3. Any calling of the model that has Tensor inputs requires the tensor to have leaf=True

  4. Any calling of the model requires keyword inputs.

class DNN2(Module):
    def __init__(self, dtype=PRECISION):

        self.dtype          = dtype
        batch_size          = 1
        self.flatten        = Flatten()
        self.dense1         = Linear(i_dim=28*28, o_dim=100)
        self.relu1          = ReLU()
        self.dense2         = Linear(i_dim=100, o_dim=10)
        super().__init__(x=Tensor(np.ones((batch_size, 1, 28, 28), dtype=self.dtype), leaf=True))

    def forward(self, x:Tensor):
        x = self.flatten(x)
        x = self.dense1(x)
        x = self.relu1(x)
        x = self.dense2(x)
        return x

By subclassing, model forward passes can be performed by calling the model on the needed inputs, ensuring that all input keyword arguments are specified:

model = DNN2()
model(**kwargs)

The below now illustrates the difference:

dnn1    = DNN()
dnn2    = DNN2()

batch_size = 16
x          = Tensor(np.ones((batch_size,28,28), dtype=np.float32), leaf=True)

fwd1    = dnn1.forward(x)   # shape=(batch_size, 1, 10)
fwd2    = dnn2(x=x)         # shape=(batch_size, 1, 10)

fwd1.backward()
fwd2.backward()

dnn1.dense2.W.value.shape, dnn2.dense2.W.value.shape
# -> ((batch_size, 100, 10), (1, 100, 10))
dnn1.dense2.W.grad.shape, dnn2.dense2.W.grad.shape
# -> ((batch_size, 100, 10), (1, 100, 10))

Both versions are able to apply batched forward passes on the input. However, due to Tensor automatically rescaling due to broadcasting, only the model subclassing Module is able to maintain the originally instantiated shape of values and gradients.

Performing gradient descent on the original model would require resetting values and gradient shapes of each model weight, and updating gradients according to the batched versions, undoing any broadcasting. This process is done automatically when subclassing Module, with the batched copy of the model available under model.copy.

Using Module makes it easy to perform gradient descent with optims:

  • Model weights are found in model.weights. These weights are given to the optimizer for updating.

  • Batched model data which is stored in the model after calling on batched data is reset with model.model_reset(). If this is not reset, the previous model gradients will accumulate. This will also stop the model from training if different batch sizes are given from one training epoch from the next.

from pygrad.tensor import Tensor
from pygrad.optims import SGD
import numpy as np

model       = DNN2()                        # defined previously, subclassing Module
optim       = SGD(model.weights, lr=0.1)    # model.weights property is available

n_epochs    = 25
batch_size  = 4

for _ in range(n_epochs):
    x           = Tensor(np.random.uniform(0,1,(batch_size,28,28)), leaf=True)
    y_true      = Tensor(np.ones((batch_size,1,10), dtype=np.float32), leaf=True)

    model.model_reset()
    optim.zero_grad()
    y_pred = model(x=x)
    loss   = ((y_pred - y_true)**2).sum(axis=-1).mean(axis=0, keepdims=False) # averages over the batch
    loss.backward()
    optim.step(loss)

The model’s weights will be updated here according to losses averaged over the batch, but without any change to their originally defined shape. For more training examples, see Examples.

For class methods, see Module