What is GELU activation? Announcing the arrival of Valued Associate #679: Cesar Manara ...
Do I really need to have a message in a novel to appeal to readers?
ArcGIS Pro Python arcpy.CreatePersonalGDB_management
Is there any word for a place full of confusion?
Why wasn't DOSKEY integrated with COMMAND.COM?
Do wooden building fires get hotter than 600°C?
Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode
Why aren't air breathing engines used as small first stages?
How fail-safe is nr as stop bytes?
Dating a Former Employee
Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?
How could we fake a moon landing now?
Question about debouncing - delay of state change
Why is Nikon 1.4g better when Nikon 1.8g is sharper?
Amount of permutations on an NxNxN Rubik's Cube
What initially awakened the Balrog?
Why do we bend a book to keep it straight?
Is it possible for SQL statements to execute concurrently within a single session in SQL Server?
Is there hard evidence that the grant peer review system performs significantly better than random?
How to react to hostile behavior from a senior developer?
Significance of Cersei's obsession with elephants?
Do any jurisdictions seriously consider reclassifying social media websites as publishers?
Should I follow up with an employee I believe overracted to a mistake I made?
Did Deadpool rescue all of the X-Force?
Generate an RGB colour grid
What is GELU activation?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsIs it possible to customize the activation function in scikit-learn's MLPRegressor?Alternatives to linear activation function in regression tasks to limit the outputProperly using activation functions of neural networkActivation function vs Squashing functionWhat does it mean for an activation function to be “saturated/non-saturated”?What's the correct reasoning behind solving the vanishing/exploding gradient problem in deep neural networks.?activation functions in multiple layers in CNNsHow Transformer is Bidirectional - Machine LearningOutput range of BERT model shrinks after fine-tuning on domain specific datasetWhat are best activation and regularization method for LSTM?
$begingroup$
I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as
$$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which appriximates to $$0.5x(1 + tanh[sqrt{
2/π}(x + 0.044715x^3)])$$
Could you simplify the equation and explain how it has been approimated.
activation-function bert mathematics
$endgroup$
add a comment |
$begingroup$
I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as
$$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which appriximates to $$0.5x(1 + tanh[sqrt{
2/π}(x + 0.044715x^3)])$$
Could you simplify the equation and explain how it has been approimated.
activation-function bert mathematics
$endgroup$
add a comment |
$begingroup$
I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as
$$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which appriximates to $$0.5x(1 + tanh[sqrt{
2/π}(x + 0.044715x^3)])$$
Could you simplify the equation and explain how it has been approimated.
activation-function bert mathematics
$endgroup$
I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as
$$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which appriximates to $$0.5x(1 + tanh[sqrt{
2/π}(x + 0.044715x^3)])$$
Could you simplify the equation and explain how it has been approimated.
activation-function bert mathematics
activation-function bert mathematics
asked 17 hours ago
thanatozthanatoz
682421
682421
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
For these type of numerical approximations, the key idea is to find a similar function (based on experience), parameterize it, and then fit it to a set of points from the original function.
Lets expand the cumulative distribution $Phi(x)$:
$text{GELU}(x):=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)$
Note that this is a definition, not an equation (a relation).
Knowing that $text{erf}(x)$ is very close to $text{tanh}(x)$

and first derivatives of $text{erf}(frac{x}{sqrt{2}})$ and $text{tanh}(sqrt{frac{2}{pi}}x)$ coincide at $x=0$, which is $sqrt{frac{2}{pi}}$, we proceed to fit
$$text{tanh}left(sqrt{frac{2}{pi}}(x+ax^2+bx^3+cx^4+dx^5)right)$$ (or more terms) to some points $left(x_i, text{erf}(frac{x_i}{sqrt{2}})right)$. I have fitted this function to 20 samples between $(-1.5, 1.5)$ (using this site), and here are the coefficients:

By setting $a=c=d=0$, $b$ was estimated to be $0.04495641$. With more samples from a wider range (that site only allowed 20), coefficient $b$ will be closer to paper's $0.044715$. Finally we get:
$text{GELU}(x)=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)=0.5xleft(1+text{tanh}left(sqrt{frac{2}{pi}}(x+0.044715x^3)right)right)$
Note that if we did not utilize the relation between the first derivatives, term $sqrt{frac{2}{pi}}$ would have been included in the parameters as follows
$$0.5xleft(1+text{tanh}left(0.797885x+0.035677x^3right)right)$$
which is less beautiful (less analytical, more numerical)!
A similar relation holds between $text{erf}(x)$ and $2left(sigma(x)-frac{1}{2}right)$ (sigmoid), which is proposed in the paper as another approximation.

$endgroup$
add a comment |
$begingroup$
First note that $$Phi(x) = frac12 mathrm{erfc}left(-frac{x}{sqrt{2}}right) = frac12 left(1 + mathrm{erf}left(frac{x}{sqrt2}right)right)$$ by parity of $mathrm{erf}$. We need to show that $$mathrm{erf}left(frac x {sqrt2}right) approx tanhleft(sqrt{frac2pi} left(x + a x^3right)right)$$ for $a approx 0.044715$.
For large values of $x$, both functions are bounded in $[-1, 1]$. For small $x$, the respective Taylor series read $$tanh(x) = x - frac{x^3}{3} + o(x^3)$$ and $$mathrm{erf}(x) = frac{2}{sqrt{pi}} left(x - frac{x^3}{3}right) + o(x^3).$$
Substituting, we get that $$
tanhleft(sqrt{frac2pi} left(x + a x^3right)right) = sqrtfrac{2}{pi} left(x + left(a-frac{2}{3pi}right)x^3right) + o(x^3)
$$
and
$$
mathrm{erf}left(frac x {sqrt2}right) = sqrtfrac2pi left(x - frac{x^3}{6}right) + o(x^3).
$$
Equating coefficient for $x^3$, we find
$$
a approx 0.04553992412
$$
close to the paper's $0.044715$.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49522%2fwhat-is-gelu-activation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
For these type of numerical approximations, the key idea is to find a similar function (based on experience), parameterize it, and then fit it to a set of points from the original function.
Lets expand the cumulative distribution $Phi(x)$:
$text{GELU}(x):=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)$
Note that this is a definition, not an equation (a relation).
Knowing that $text{erf}(x)$ is very close to $text{tanh}(x)$

and first derivatives of $text{erf}(frac{x}{sqrt{2}})$ and $text{tanh}(sqrt{frac{2}{pi}}x)$ coincide at $x=0$, which is $sqrt{frac{2}{pi}}$, we proceed to fit
$$text{tanh}left(sqrt{frac{2}{pi}}(x+ax^2+bx^3+cx^4+dx^5)right)$$ (or more terms) to some points $left(x_i, text{erf}(frac{x_i}{sqrt{2}})right)$. I have fitted this function to 20 samples between $(-1.5, 1.5)$ (using this site), and here are the coefficients:

By setting $a=c=d=0$, $b$ was estimated to be $0.04495641$. With more samples from a wider range (that site only allowed 20), coefficient $b$ will be closer to paper's $0.044715$. Finally we get:
$text{GELU}(x)=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)=0.5xleft(1+text{tanh}left(sqrt{frac{2}{pi}}(x+0.044715x^3)right)right)$
Note that if we did not utilize the relation between the first derivatives, term $sqrt{frac{2}{pi}}$ would have been included in the parameters as follows
$$0.5xleft(1+text{tanh}left(0.797885x+0.035677x^3right)right)$$
which is less beautiful (less analytical, more numerical)!
A similar relation holds between $text{erf}(x)$ and $2left(sigma(x)-frac{1}{2}right)$ (sigmoid), which is proposed in the paper as another approximation.

$endgroup$
add a comment |
$begingroup$
For these type of numerical approximations, the key idea is to find a similar function (based on experience), parameterize it, and then fit it to a set of points from the original function.
Lets expand the cumulative distribution $Phi(x)$:
$text{GELU}(x):=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)$
Note that this is a definition, not an equation (a relation).
Knowing that $text{erf}(x)$ is very close to $text{tanh}(x)$

and first derivatives of $text{erf}(frac{x}{sqrt{2}})$ and $text{tanh}(sqrt{frac{2}{pi}}x)$ coincide at $x=0$, which is $sqrt{frac{2}{pi}}$, we proceed to fit
$$text{tanh}left(sqrt{frac{2}{pi}}(x+ax^2+bx^3+cx^4+dx^5)right)$$ (or more terms) to some points $left(x_i, text{erf}(frac{x_i}{sqrt{2}})right)$. I have fitted this function to 20 samples between $(-1.5, 1.5)$ (using this site), and here are the coefficients:

By setting $a=c=d=0$, $b$ was estimated to be $0.04495641$. With more samples from a wider range (that site only allowed 20), coefficient $b$ will be closer to paper's $0.044715$. Finally we get:
$text{GELU}(x)=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)=0.5xleft(1+text{tanh}left(sqrt{frac{2}{pi}}(x+0.044715x^3)right)right)$
Note that if we did not utilize the relation between the first derivatives, term $sqrt{frac{2}{pi}}$ would have been included in the parameters as follows
$$0.5xleft(1+text{tanh}left(0.797885x+0.035677x^3right)right)$$
which is less beautiful (less analytical, more numerical)!
A similar relation holds between $text{erf}(x)$ and $2left(sigma(x)-frac{1}{2}right)$ (sigmoid), which is proposed in the paper as another approximation.

$endgroup$
add a comment |
$begingroup$
For these type of numerical approximations, the key idea is to find a similar function (based on experience), parameterize it, and then fit it to a set of points from the original function.
Lets expand the cumulative distribution $Phi(x)$:
$text{GELU}(x):=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)$
Note that this is a definition, not an equation (a relation).
Knowing that $text{erf}(x)$ is very close to $text{tanh}(x)$

and first derivatives of $text{erf}(frac{x}{sqrt{2}})$ and $text{tanh}(sqrt{frac{2}{pi}}x)$ coincide at $x=0$, which is $sqrt{frac{2}{pi}}$, we proceed to fit
$$text{tanh}left(sqrt{frac{2}{pi}}(x+ax^2+bx^3+cx^4+dx^5)right)$$ (or more terms) to some points $left(x_i, text{erf}(frac{x_i}{sqrt{2}})right)$. I have fitted this function to 20 samples between $(-1.5, 1.5)$ (using this site), and here are the coefficients:

By setting $a=c=d=0$, $b$ was estimated to be $0.04495641$. With more samples from a wider range (that site only allowed 20), coefficient $b$ will be closer to paper's $0.044715$. Finally we get:
$text{GELU}(x)=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)=0.5xleft(1+text{tanh}left(sqrt{frac{2}{pi}}(x+0.044715x^3)right)right)$
Note that if we did not utilize the relation between the first derivatives, term $sqrt{frac{2}{pi}}$ would have been included in the parameters as follows
$$0.5xleft(1+text{tanh}left(0.797885x+0.035677x^3right)right)$$
which is less beautiful (less analytical, more numerical)!
A similar relation holds between $text{erf}(x)$ and $2left(sigma(x)-frac{1}{2}right)$ (sigmoid), which is proposed in the paper as another approximation.

$endgroup$
For these type of numerical approximations, the key idea is to find a similar function (based on experience), parameterize it, and then fit it to a set of points from the original function.
Lets expand the cumulative distribution $Phi(x)$:
$text{GELU}(x):=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)$
Note that this is a definition, not an equation (a relation).
Knowing that $text{erf}(x)$ is very close to $text{tanh}(x)$

and first derivatives of $text{erf}(frac{x}{sqrt{2}})$ and $text{tanh}(sqrt{frac{2}{pi}}x)$ coincide at $x=0$, which is $sqrt{frac{2}{pi}}$, we proceed to fit
$$text{tanh}left(sqrt{frac{2}{pi}}(x+ax^2+bx^3+cx^4+dx^5)right)$$ (or more terms) to some points $left(x_i, text{erf}(frac{x_i}{sqrt{2}})right)$. I have fitted this function to 20 samples between $(-1.5, 1.5)$ (using this site), and here are the coefficients:

By setting $a=c=d=0$, $b$ was estimated to be $0.04495641$. With more samples from a wider range (that site only allowed 20), coefficient $b$ will be closer to paper's $0.044715$. Finally we get:
$text{GELU}(x)=xPhi(x)=0.5xleft(1+text{erf}(frac{x}{sqrt{2}})right)=0.5xleft(1+text{tanh}left(sqrt{frac{2}{pi}}(x+0.044715x^3)right)right)$
Note that if we did not utilize the relation between the first derivatives, term $sqrt{frac{2}{pi}}$ would have been included in the parameters as follows
$$0.5xleft(1+text{tanh}left(0.797885x+0.035677x^3right)right)$$
which is less beautiful (less analytical, more numerical)!
A similar relation holds between $text{erf}(x)$ and $2left(sigma(x)-frac{1}{2}right)$ (sigmoid), which is proposed in the paper as another approximation.

edited 11 hours ago
answered 11 hours ago
EsmailianEsmailian
3,431420
3,431420
add a comment |
add a comment |
$begingroup$
First note that $$Phi(x) = frac12 mathrm{erfc}left(-frac{x}{sqrt{2}}right) = frac12 left(1 + mathrm{erf}left(frac{x}{sqrt2}right)right)$$ by parity of $mathrm{erf}$. We need to show that $$mathrm{erf}left(frac x {sqrt2}right) approx tanhleft(sqrt{frac2pi} left(x + a x^3right)right)$$ for $a approx 0.044715$.
For large values of $x$, both functions are bounded in $[-1, 1]$. For small $x$, the respective Taylor series read $$tanh(x) = x - frac{x^3}{3} + o(x^3)$$ and $$mathrm{erf}(x) = frac{2}{sqrt{pi}} left(x - frac{x^3}{3}right) + o(x^3).$$
Substituting, we get that $$
tanhleft(sqrt{frac2pi} left(x + a x^3right)right) = sqrtfrac{2}{pi} left(x + left(a-frac{2}{3pi}right)x^3right) + o(x^3)
$$
and
$$
mathrm{erf}left(frac x {sqrt2}right) = sqrtfrac2pi left(x - frac{x^3}{6}right) + o(x^3).
$$
Equating coefficient for $x^3$, we find
$$
a approx 0.04553992412
$$
close to the paper's $0.044715$.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
First note that $$Phi(x) = frac12 mathrm{erfc}left(-frac{x}{sqrt{2}}right) = frac12 left(1 + mathrm{erf}left(frac{x}{sqrt2}right)right)$$ by parity of $mathrm{erf}$. We need to show that $$mathrm{erf}left(frac x {sqrt2}right) approx tanhleft(sqrt{frac2pi} left(x + a x^3right)right)$$ for $a approx 0.044715$.
For large values of $x$, both functions are bounded in $[-1, 1]$. For small $x$, the respective Taylor series read $$tanh(x) = x - frac{x^3}{3} + o(x^3)$$ and $$mathrm{erf}(x) = frac{2}{sqrt{pi}} left(x - frac{x^3}{3}right) + o(x^3).$$
Substituting, we get that $$
tanhleft(sqrt{frac2pi} left(x + a x^3right)right) = sqrtfrac{2}{pi} left(x + left(a-frac{2}{3pi}right)x^3right) + o(x^3)
$$
and
$$
mathrm{erf}left(frac x {sqrt2}right) = sqrtfrac2pi left(x - frac{x^3}{6}right) + o(x^3).
$$
Equating coefficient for $x^3$, we find
$$
a approx 0.04553992412
$$
close to the paper's $0.044715$.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
First note that $$Phi(x) = frac12 mathrm{erfc}left(-frac{x}{sqrt{2}}right) = frac12 left(1 + mathrm{erf}left(frac{x}{sqrt2}right)right)$$ by parity of $mathrm{erf}$. We need to show that $$mathrm{erf}left(frac x {sqrt2}right) approx tanhleft(sqrt{frac2pi} left(x + a x^3right)right)$$ for $a approx 0.044715$.
For large values of $x$, both functions are bounded in $[-1, 1]$. For small $x$, the respective Taylor series read $$tanh(x) = x - frac{x^3}{3} + o(x^3)$$ and $$mathrm{erf}(x) = frac{2}{sqrt{pi}} left(x - frac{x^3}{3}right) + o(x^3).$$
Substituting, we get that $$
tanhleft(sqrt{frac2pi} left(x + a x^3right)right) = sqrtfrac{2}{pi} left(x + left(a-frac{2}{3pi}right)x^3right) + o(x^3)
$$
and
$$
mathrm{erf}left(frac x {sqrt2}right) = sqrtfrac2pi left(x - frac{x^3}{6}right) + o(x^3).
$$
Equating coefficient for $x^3$, we find
$$
a approx 0.04553992412
$$
close to the paper's $0.044715$.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
First note that $$Phi(x) = frac12 mathrm{erfc}left(-frac{x}{sqrt{2}}right) = frac12 left(1 + mathrm{erf}left(frac{x}{sqrt2}right)right)$$ by parity of $mathrm{erf}$. We need to show that $$mathrm{erf}left(frac x {sqrt2}right) approx tanhleft(sqrt{frac2pi} left(x + a x^3right)right)$$ for $a approx 0.044715$.
For large values of $x$, both functions are bounded in $[-1, 1]$. For small $x$, the respective Taylor series read $$tanh(x) = x - frac{x^3}{3} + o(x^3)$$ and $$mathrm{erf}(x) = frac{2}{sqrt{pi}} left(x - frac{x^3}{3}right) + o(x^3).$$
Substituting, we get that $$
tanhleft(sqrt{frac2pi} left(x + a x^3right)right) = sqrtfrac{2}{pi} left(x + left(a-frac{2}{3pi}right)x^3right) + o(x^3)
$$
and
$$
mathrm{erf}left(frac x {sqrt2}right) = sqrtfrac2pi left(x - frac{x^3}{6}right) + o(x^3).
$$
Equating coefficient for $x^3$, we find
$$
a approx 0.04553992412
$$
close to the paper's $0.044715$.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 11 hours ago
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 11 hours ago
BookYourLuckBookYourLuck
313
313
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
BookYourLuck is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49522%2fwhat-is-gelu-activation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown