Is it 40% or 0.4%?
lead or lag function to get several values, not just the nth
When an experienced monk meditates how much does their mind wander?
It took me a lot of time to make this, pls like. (YouTube Comments #1)
Graphing random points on the XY-plane
When was drinking water recognized as crucial in marathon running?
Is divide-by-zero a security vulnerability?
Rationale to prefer local variables over instance variables?
Why can't we make a perpetual motion machine by using a magnet to pull up a piece of metal, then letting it fall back down?
In the comics did Thanos "kill" just sentient beings or all creatures with the snap?
What could trigger powerful quakes on icy world?
Forward slip vs side slip
Didactic impediments of using simplified versions
In iTunes 12 on macOS, how can I reset the skip count of a song?
Don't know what I’m looking for regarding removable HDDs?
Why do phishing e-mails use faked e-mail addresses instead of the real one?
Are small insurances worth it
If a set is open, does that imply that it has no boundary points?
How to mitigate "bandwagon attacking" from players?
What are all the squawk codes?
Get length of the longest sequence of numbers with the same sign
Can a space-faring robot still function over a billion years?
Where is the fallacy here?
RS232 to Raspberry Pi Zero W
Is the withholding of funding notice allowed?
Is it 40% or 0.4%?
$begingroup$
A variable, which should contain percents, also contains some "ratio" values, for example:
0.61
41
54
.4
.39
20
52
0.7
12
70
82
The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).
Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?
data-cleaning
$endgroup$
|
show 6 more comments
$begingroup$
A variable, which should contain percents, also contains some "ratio" values, for example:
0.61
41
54
.4
.39
20
52
0.7
12
70
82
The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).
Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?
data-cleaning
$endgroup$
2
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
2
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
1
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
2
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
6
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago
|
show 6 more comments
$begingroup$
A variable, which should contain percents, also contains some "ratio" values, for example:
0.61
41
54
.4
.39
20
52
0.7
12
70
82
The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).
Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?
data-cleaning
$endgroup$
A variable, which should contain percents, also contains some "ratio" values, for example:
0.61
41
54
.4
.39
20
52
0.7
12
70
82
The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).
Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?
data-cleaning
data-cleaning
edited yesterday
Orion
asked yesterday
OrionOrion
5812
5812
2
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
2
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
1
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
2
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
6
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago
|
show 6 more comments
2
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
2
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
1
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
2
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
6
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago
2
2
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
2
2
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
1
1
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
2
2
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
6
6
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago
|
show 6 more comments
2 Answers
2
active
oldest
votes
$begingroup$
Assuming
- The only data you have is the percents/ratios (no other related explanatory variables)
- Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
- The percent/ratios are all between $0$ and $100$.
Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.
You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.
Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.
In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.
$endgroup$
add a comment |
$begingroup$
Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.
Therefore, your data must be percents. You're welcome.
$endgroup$
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395626%2fis-it-40-or-0-4%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Assuming
- The only data you have is the percents/ratios (no other related explanatory variables)
- Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
- The percent/ratios are all between $0$ and $100$.
Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.
You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.
Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.
In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.
$endgroup$
add a comment |
$begingroup$
Assuming
- The only data you have is the percents/ratios (no other related explanatory variables)
- Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
- The percent/ratios are all between $0$ and $100$.
Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.
You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.
Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.
In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.
$endgroup$
add a comment |
$begingroup$
Assuming
- The only data you have is the percents/ratios (no other related explanatory variables)
- Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
- The percent/ratios are all between $0$ and $100$.
Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.
You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.
Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.
In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.
$endgroup$
Assuming
- The only data you have is the percents/ratios (no other related explanatory variables)
- Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
- The percent/ratios are all between $0$ and $100$.
Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.
You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.
Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.
In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.
edited 5 hours ago
Nick Cox
38.8k583129
38.8k583129
answered 23 hours ago
djmadjma
65947
65947
add a comment |
add a comment |
$begingroup$
Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.
Therefore, your data must be percents. You're welcome.
$endgroup$
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
add a comment |
$begingroup$
Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.
Therefore, your data must be percents. You're welcome.
$endgroup$
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
add a comment |
$begingroup$
Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.
Therefore, your data must be percents. You're welcome.
$endgroup$
Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.
Therefore, your data must be percents. You're welcome.
answered yesterday
beta1_equals_beta2beta1_equals_beta2
412
412
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
add a comment |
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
3
3
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
$begingroup$
The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
$endgroup$
– The Laconic
yesterday
3
3
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
$begingroup$
If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
$endgroup$
– beta1_equals_beta2
yesterday
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395626%2fis-it-40-or-0-4%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday
2
$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday
1
$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday
2
$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber♦
23 hours ago
6
$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago