Is it 40% or 0.4%?

lead or lag function to get several values, not just the nth

When an experienced monk meditates how much does their mind wander?

It took me a lot of time to make this, pls like. (YouTube Comments #1)

Graphing random points on the XY-plane

When was drinking water recognized as crucial in marathon running?

Is divide-by-zero a security vulnerability?

Rationale to prefer local variables over instance variables?

Why can't we make a perpetual motion machine by using a magnet to pull up a piece of metal, then letting it fall back down?

In the comics did Thanos "kill" just sentient beings or all creatures with the snap?

What could trigger powerful quakes on icy world?

Forward slip vs side slip

Didactic impediments of using simplified versions

In iTunes 12 on macOS, how can I reset the skip count of a song?

Don't know what I’m looking for regarding removable HDDs?

Why do phishing e-mails use faked e-mail addresses instead of the real one?

Are small insurances worth it

If a set is open, does that imply that it has no boundary points?

How to mitigate "bandwagon attacking" from players?

What are all the squawk codes?

Get length of the longest sequence of numbers with the same sign

Can a space-faring robot still function over a billion years?

Where is the fallacy here?

RS232 to Raspberry Pi Zero W

Is the withholding of funding notice allowed?



Is it 40% or 0.4%?














3












$begingroup$


A variable, which should contain percents, also contains some "ratio" values, for example:



0.61
41
54
.4
.39
20
52
0.7
12
70
82


The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).



Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?










share|cite|improve this question











$endgroup$








  • 2




    $begingroup$
    I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
    $endgroup$
    – Sycorax
    yesterday






  • 2




    $begingroup$
    What the data mean != what is the (data) mean.
    $endgroup$
    – Nick Cox
    yesterday






  • 1




    $begingroup$
    You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
    $endgroup$
    – EngrStudent
    yesterday






  • 2




    $begingroup$
    There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
    $endgroup$
    – whuber
    23 hours ago








  • 6




    $begingroup$
    I'm guessing that "ask the people who collected the data" isn't a valid option, here?
    $endgroup$
    – nick012000
    19 hours ago
















3












$begingroup$


A variable, which should contain percents, also contains some "ratio" values, for example:



0.61
41
54
.4
.39
20
52
0.7
12
70
82


The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).



Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?










share|cite|improve this question











$endgroup$








  • 2




    $begingroup$
    I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
    $endgroup$
    – Sycorax
    yesterday






  • 2




    $begingroup$
    What the data mean != what is the (data) mean.
    $endgroup$
    – Nick Cox
    yesterday






  • 1




    $begingroup$
    You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
    $endgroup$
    – EngrStudent
    yesterday






  • 2




    $begingroup$
    There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
    $endgroup$
    – whuber
    23 hours ago








  • 6




    $begingroup$
    I'm guessing that "ask the people who collected the data" isn't a valid option, here?
    $endgroup$
    – nick012000
    19 hours ago














3












3








3





$begingroup$


A variable, which should contain percents, also contains some "ratio" values, for example:



0.61
41
54
.4
.39
20
52
0.7
12
70
82


The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).



Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?










share|cite|improve this question











$endgroup$




A variable, which should contain percents, also contains some "ratio" values, for example:



0.61
41
54
.4
.39
20
52
0.7
12
70
82


The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).



Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?







data-cleaning






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited yesterday







Orion

















asked yesterday









OrionOrion

5812




5812








  • 2




    $begingroup$
    I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
    $endgroup$
    – Sycorax
    yesterday






  • 2




    $begingroup$
    What the data mean != what is the (data) mean.
    $endgroup$
    – Nick Cox
    yesterday






  • 1




    $begingroup$
    You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
    $endgroup$
    – EngrStudent
    yesterday






  • 2




    $begingroup$
    There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
    $endgroup$
    – whuber
    23 hours ago








  • 6




    $begingroup$
    I'm guessing that "ask the people who collected the data" isn't a valid option, here?
    $endgroup$
    – nick012000
    19 hours ago














  • 2




    $begingroup$
    I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
    $endgroup$
    – Sycorax
    yesterday






  • 2




    $begingroup$
    What the data mean != what is the (data) mean.
    $endgroup$
    – Nick Cox
    yesterday






  • 1




    $begingroup$
    You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
    $endgroup$
    – EngrStudent
    yesterday






  • 2




    $begingroup$
    There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
    $endgroup$
    – whuber
    23 hours ago








  • 6




    $begingroup$
    I'm guessing that "ask the people who collected the data" isn't a valid option, here?
    $endgroup$
    – nick012000
    19 hours ago








2




2




$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday




$begingroup$
I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know?
$endgroup$
– Sycorax
yesterday




2




2




$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday




$begingroup$
What the data mean != what is the (data) mean.
$endgroup$
– Nick Cox
yesterday




1




1




$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday




$begingroup$
You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options?
$endgroup$
– EngrStudent
yesterday




2




2




$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber
23 hours ago






$begingroup$
There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution).
$endgroup$
– whuber
23 hours ago






6




6




$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago




$begingroup$
I'm guessing that "ask the people who collected the data" isn't a valid option, here?
$endgroup$
– nick012000
19 hours ago










2 Answers
2






active

oldest

votes


















5












$begingroup$

Assuming




  • The only data you have is the percents/ratios (no other related explanatory variables)

  • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).

  • The percent/ratios are all between $0$ and $100$.


Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.



You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.



Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.



In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.






share|cite|improve this answer











$endgroup$





















    0












    $begingroup$

    Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.



    Therefore, your data must be percents. You're welcome.






    share|cite|improve this answer









    $endgroup$









    • 3




      $begingroup$
      The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
      $endgroup$
      – The Laconic
      yesterday






    • 3




      $begingroup$
      If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
      $endgroup$
      – beta1_equals_beta2
      yesterday











    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395626%2fis-it-40-or-0-4%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5












    $begingroup$

    Assuming




    • The only data you have is the percents/ratios (no other related explanatory variables)

    • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).

    • The percent/ratios are all between $0$ and $100$.


    Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.



    You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.



    Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.



    In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.






    share|cite|improve this answer











    $endgroup$


















      5












      $begingroup$

      Assuming




      • The only data you have is the percents/ratios (no other related explanatory variables)

      • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).

      • The percent/ratios are all between $0$ and $100$.


      Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.



      You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.



      Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.



      In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.






      share|cite|improve this answer











      $endgroup$
















        5












        5








        5





        $begingroup$

        Assuming




        • The only data you have is the percents/ratios (no other related explanatory variables)

        • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).

        • The percent/ratios are all between $0$ and $100$.


        Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.



        You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.



        Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.



        In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.






        share|cite|improve this answer











        $endgroup$



        Assuming




        • The only data you have is the percents/ratios (no other related explanatory variables)

        • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).

        • The percent/ratios are all between $0$ and $100$.


        Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.



        You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.



        Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.



        In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited 5 hours ago









        Nick Cox

        38.8k583129




        38.8k583129










        answered 23 hours ago









        djmadjma

        65947




        65947

























            0












            $begingroup$

            Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.



            Therefore, your data must be percents. You're welcome.






            share|cite|improve this answer









            $endgroup$









            • 3




              $begingroup$
              The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
              $endgroup$
              – The Laconic
              yesterday






            • 3




              $begingroup$
              If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
              $endgroup$
              – beta1_equals_beta2
              yesterday
















            0












            $begingroup$

            Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.



            Therefore, your data must be percents. You're welcome.






            share|cite|improve this answer









            $endgroup$









            • 3




              $begingroup$
              The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
              $endgroup$
              – The Laconic
              yesterday






            • 3




              $begingroup$
              If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
              $endgroup$
              – beta1_equals_beta2
              yesterday














            0












            0








            0





            $begingroup$

            Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.



            Therefore, your data must be percents. You're welcome.






            share|cite|improve this answer









            $endgroup$



            Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.



            Therefore, your data must be percents. You're welcome.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered yesterday









            beta1_equals_beta2beta1_equals_beta2

            412




            412








            • 3




              $begingroup$
              The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
              $endgroup$
              – The Laconic
              yesterday






            • 3




              $begingroup$
              If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
              $endgroup$
              – beta1_equals_beta2
              yesterday














            • 3




              $begingroup$
              The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
              $endgroup$
              – The Laconic
              yesterday






            • 3




              $begingroup$
              If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
              $endgroup$
              – beta1_equals_beta2
              yesterday








            3




            3




            $begingroup$
            The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
            $endgroup$
            – The Laconic
            yesterday




            $begingroup$
            The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either.
            $endgroup$
            – The Laconic
            yesterday




            3




            3




            $begingroup$
            If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
            $endgroup$
            – beta1_equals_beta2
            yesterday




            $begingroup$
            If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question.
            $endgroup$
            – beta1_equals_beta2
            yesterday


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395626%2fis-it-40-or-0-4%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            VNC viewer RFB protocol error: bad desktop size 0x0I Cannot Type the Key 'd' (lowercase) in VNC Viewer...

            Tribunal Administrativo e Fiscal de Mirandela Referências Menu de...

            looking for continuous Screen Capture for retroactivly reproducing errors, timeback machineRolling desktop...