Can anyone explain to me in detail what is happening here? what exactly is torch doing?
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "A Visual Notebook to Using BERT for the First Time.ipynb",
"provenance": [],
"machine_shape": "hm",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
This file has been truncated. show original
The Output is same with and without using torch. Moreover I didn’t observe any significant difference in the run time as well.
Sebgolos
(Sebastian)
December 25, 2020, 10:20am
#3
I’m not sure which part of torch
you mean here. I’m sure the output wouldn’t be the same if you didn’t use torch
at all.
If it’s torch.no_grad()
, then the output will not change, because no_grad
is used only to avoid grad calculation.
What i meant was
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
gives the same output as
last_hidden_states = model(input_ids, attention_mask=attention_mask).
why am I using a torch.no_grad here?
Sebgolos
(Sebastian)
December 25, 2020, 12:49pm
#5
TL:DR
To avoid grad calculation.
A bit longer answer:
You only ask for an answer (for the given input), so there’s no need to track any gradients, which would be useful for backpropagation.
By avoiding the calculation of gradients, you keep the memory usage low, because the grads usually take a lot of data.