Lecture 3: Training Deep Neural Networks on a GPU

Session Recordings:
English: https://youtu.be/Qn5DDQn0fx0
Hindi: https://youtu.be/9MqR_BsogDw


Additional Resources

What to do after the lecture?

  • Run the Jupyter notebooks shared above (try other datasets)
  • Ask and answer questions on this topic (scroll down)
  • Start working on Assignment 3 - Feed Forward Neural Networks

Asking/Answering Questions :
Reply to this thread to ask questions. Before asking, scroll through the thread and check if your question (or a similar one) is already present. If yes, just like it. We will give priority to the questions with the most likes. The rest will be answered by our mentors or the community. If you see a question you know the answer to, please post your answer as a reply to that question. Let’s help each other learn!

How /when to decide whether to use ‘cuda’ or ‘cpu’ … any datasize calculation to be done?

cuda == GPU, much faster, but limited by the size of VRAM
cpu == CPU, slower, but can usually use more data (limited by RAM/virtual memory)


  • When a NVIDIA GPU is available
  • When the speed is the priority than cost (usually while training or serving predictions to multiple instances)


  • When you are on a AMD GPU or only have CPU available
  • When speed is not the priority (serving periodic predictions)

Batch size calculation is need based on the model and amount of VRAM/RAM available in your machine

How do we decide on which function to use for non-linearity and why do we use only functions like relu or sigmoid or tanh which go constant after a particular level

1 Like

Why ReLU - the derivative is incredibly easy to calculate:
1 - above 0
0 - below or equal to 0
Sometimes you can also encounter leaky ReLU, which treats the negative values a bit differently. But the derivative is still easy to calculate, and it performs well with approximating any function.

Tanh/sigmoid - they’re mainly used whenever you want to be sure that the values produced by the network stay in the desired range.
You can also consider a sigmoid to represent a probability of something belonging to some class (useful for multi-label classification, since you can’t use cross entropy then).

How do we decide - that’s the question :smiley:
There have been some self-normalizing activation functions, but honestly, I’m yet to notice any nice results coming from them.
I like parametrized functions very much - the network doesn’t only learn the weights of how to classify something, but also takes care of modifying the activation function to match whatever problem it encounters.

Why am I getting this error ?

    image, label = dataset[0]
   print('image.shape:', image.shape) 
   plt.imshow(image.permute(1, 2, 0), cmap='gray')  <--- error encountered
   print('Label:', label)

Error message:
TypeError: Invalid shape (28, 28, 1) for image data

Because your image has only one channel. You need to use squeeze not permute when viewing such images.

If your images have only one channel in this assignment, then I suppose there’s an error somewhere else (CIFAR-10 has RGB images).


thanks for your response, but I still didn’t get it. This is the exact same code from the lecture 3 starter notebook.

Ok, I have noticed that you try to use the MNIST dataset not CIFAR-10, but the code seems to come from the notebook where CIFAR-10 dataset is being used.

I’ve looked at the starter book for this lecture, and there’s no such code in there. You’re sure you didn’t copy some code from other notebook?

this is the notebook : https://jovian.ai/learn/deep-learning-with-pytorch-zero-to-gans/lesson/lesson-3-training-deep-neural-networks-on-a-gpu

and please check ln[5].

Gotta admit that it’s weird (mention for @aakashns because it’s his notebook).

Maybe it’s some newer/older version of matplotlib.

As someone pointed out though, you need to use squeeze() still if such error happens :stuck_out_tongue:

1 Like

:sweat_smile: or may be we can use image[0,:,:] as an alternative.

What’s wrong with squeeze? :stuck_out_tongue:
It’s a bit cleaner to use image.squeeze() than image[0, :, :] imo.


Thanks for the response can you suggest any good place to read and understand activation functions

Depends on what you really want to know about them. They’re just functions, non-linear, and that’s all about them :stuck_out_tongue:

In the link above is the Swish function described which I have parameterized and made an animation you can see in my previous post. There are probably more functions, but I think it really depends on use case.


Here is my assignment :blush: Please feel free to check it out.

The highest accuracy my model got for this dataset is 49%. I have tried with many different loss functions, activation functions, optimizers, number of hidden layers, batch sizes but the accuracy never went up.
Is it normal? Please do tell me if I am missing something. :blush: Thank you.

does anybody know why am I getting this error ?
This same accuracy function worked fine in training phases.

Perhaps you defined a variable named accuracy somewhere else, which shadows the original function. Since training runs successfully I would try to look at code after training phase.

1 Like

Thanks mate. :smile: :smiley: