nm0940: today's lecture follows a er very directly on from what i was saying er yesterday and i'm afraid only one overhead is working so i'll have to be over here all the time er er yesterday i was introducing the idea of significance and er i i talked about five key ideas which i'm going to run over very quickly to start with today and then today i'm going to add two more new ideas t-, on the end well the five key ideas that i was talking about yesterday was er the null hypothesis the er test statistic the er P-value the critical region and a significant level you know so those are five key ideas that we've got to get on board before er we can we can go any further really so let me qui-, quickly review er these so first of all the null hypothesis i'll just mention verbally what they are null hypothesis H-sub-zero so er a rather at a rather general level the null hypothesis is just some statement about the distribution of the data we have observed er test statistic that's the notation i used yesterday a test statistic is a function of the data which we look at in order to inform us about whether we think H- nought is reasonable or not in the light of the data so usually the test statistic is some kind of difference between what we've observed and what we would expect under the null hypothesis and the idea is that the the bigger the value of the test statistic i-, is the more doubt we have that H-nought is actually reasonable in the light of the data so if we're going to be accepting and rejecting H-nought which we're going to be doing with ever increasing frequency as we go along er the the logic is that we will want to reject H- nought think H-nought is unreasonable when T is large and the larger it is the more unreasonable we think H-nought is and that is encapsulated by the P-value the third thing P-value is the probability that by chance if the null hypothesis is true i would get a value of the test statistic greater than than or equal to the one i actually got so the notation that i talked about yesterday that's the probability so er we just pretend for a moment that the null hypothesis actually is true and we ask ourselves well what's the chance that we would have come up with a fairly good test statistic which i-, which is as or more extreme than the one that's er one we've actually got notice the convention capital letters and small letters that i keep on going on about er well you really see the importance of that convention here so that's the er er that's the P-value critical va-, critical region was the fourth idea i talked about yesterday so er if we had a significance test for that sum that sum procedure for deciding on the basis of X whether or not H-nought is reasonable or unreasonable the critical region is just the set of all possible sets of data that would lead us to think that H-nought was unreasonable or in the jargon it would er sets of data which would lead us to reject H-nought or if you like to decide if H-nought is false so er so the first thing i'm really going to do today formally is just to just to point out that the the sensible critical region should be those values of the data X such that the P-value's sufficiently small so that's the critical region that's that's a a general concept applied with any decision procedure we might er propose and er the significance level alpha everybody uses that notation the significance level is the probability that if if H-nought is true that we would by chance get a data set that was in the critical region so these were the five things that i i talked about er yesterday and these are five key ideas er that understanding what these are is absolutely mandatory if we're going to get anywhere at all in talking about er significance so first things first thing the first thing i want to do today is to is to formally er try and explain what the connection is between the P-value and the significant level and the idea is i've already mentioned it very briefly idea is that that a sensible critical region should be the sets of data whose P-values are less than something and that is none other than the alpha so i'm going to set that round to as a theorem although you can really hardly justify it in term-, in a mathematical sense the theorem says the significance test given by this in particular critical region and the critical region we're thinking about is the sets of data such that the P-value is less than the threshold so it's the set of data X such that the P-value corresponding to X is less than or equal to a threshold alpha and the theorem says that if we take that as the critical region in other words if we decide H-nought is false when the P-value is less than alpha that is precisely a significance test with significance level alpha so this thing has significance level alpha and the proof of it is is is w-, once you see the idea if you prove it's completely trivial and i-, and it's easi-, easier to rather rather than trying to write write it out formally it it's easiest just to er look at a picture so let's think of let's think of the values of the test statistic so so er here's the distribution of the test statistic we would expect to see if the null hypothesis is true so looks like a normal distribution of course it doesn't have to be so we've got some distribution for T under H-nought now what what are we what are we doing with the P-value what we do is we we locate on the scale of the test statistic we locate a particular value of the test statistic we've observed so notice a little-X here big-X is the general one so er so we locate the value of the test statistic we've actually observed and then from the definition which you can just see still at the top there on the screen the P-value is this area isn't it it's the probability it's the probability of for getting by chance a value of T at capital-X bigger than T at little-X so what we have to show is that if we consider all possible sets of data such that this area is less than alpha then that has significance level alpha so what we have to show and this black thing's getting a bit worn out with this i'm going to do it in blue now what what what we have to show is that the probability of X belonging to the critical region calculated under the assumption of H-nought is equal to alpha is what we have to show so what is this so from the from the definition of C in the same theorem this is the probability that P-X is less than or equal to alpha by definition of what C is and that this is this is now where we use the picture when is when is this P-value this area i shaded when is this less than alpha well o-, obviously it's less than alpha if the T if if the er left-hand end of the shaded area is beyond the one-minus-alpha quantile remember about quantiles so this is exactly the same thing as i say that the this is the probability that the value T i get is greater than or equal to the point under this distribution which cuts off area alpha to the right and that's the notation we used before that's the one-minus-alpha quantile of the distribution at T and that everything is calculated under the assumption of the null hypothesis okay so Q Q this is the one-minus alpha quantile of T that's what Q is so what is the chance now that a random variable by cha-, by chance will give you a value greater than or equal to the one-minus alpha quantile well the the the suffix one-minus-alpha means you have a chance one-minus-alpha to the left so the chance to the right is therefore one minus one-minus-alpha which is alpha- T okay that's the proof it's kind of trivial really it's just a matter of understanding what probabilities we're we're we're we're we're talking about so er so this shows that this significance test rejecting H-nought when the P- value is less than alpha is a significance test with significance level alpha now in a sense this is extremely subtle because we n-, we are two interpretations of of alpha i'm sorry i don't have another overhead to switch over there so i hope you're not losing too much of this on the screen okay can i just emphasize this by just noting up here that this theorem gives us two interpretations of this quantity alpha the first interpretation is the interpretation just see it in the top there the first interpretation is alpha is a threshold for the P-value so this is a threshold for P-value and the P- value is a measure of how surprised we are to get a particular value of our test statistic so so er this interpretation is is is saying something like alpha is a measure of how surprised we've got to be to reject H-nought so i'll write that in so how surprised we have got to be before we reject H-nought by reject i mean declare it to be false or think it to be okay so first interpretation then threshold of the P-value how surprised we have to be with the data set in front of us in order to think that H-nought is false so it's a kind of measure of surprise it's a measure of how we think about the data now the second interpretation is given by the result of the theorem that alpha is a significance level now what is that that is an error rate so this is the probability that we're in the critical region given H-nought is true now what do we mean to say the test that it the data in the critical region what we mean is that we reject or decide H- nought is false so when alpha is a probability so that you'll decide H-nought is false and that but that probability is calculated on the assumption that H-nought is true so you know it's an error rate it it t-, tells you how often you make a mistake if you use this for a significance test given by comparing P-values with this threshold and so this is a this is an error rate it's a natural long run frequency probability and er in order to exercise this error rate er idea a little bit more carefully some people er use the jargon type one error rate it's a type one error rate now er why it's called type one error rate will be clear hopefully in twenty minutes' time when i've talked about a type two error rate but it's a type one error rate that's because it's just it's about making an error in one direction only if there if H-nought is true what's the chance we think it's false of course you might make an error the other way round as well and that's what i'm going to talk about later but for the moment it's just one thing okay so two two subtly different interpretations at this point here alpha well let me talk about an example now just to reinforce again this idea of what H-nought is what a test statistic is and what P-values are et cetera and again i'm sorry this silly thing isn't working well this is an example in genetics and it it's an example of historical er interest i think i've mentioned once or twice already that er early work in bio-, biology particularly in quantitative genetics er ha-, had a very informative role in the early development of statistics and er in example five which i gave out to you the other day er er there there there is a rather famous experiment from Charles Darwin which i've put in one of the questions for you to look at and here's another famous experiment done by a geneticist in er the early nineteen-twenty and and and this this work is done by somebody called er Frets who was a geneticist and published his paper in nineteen-twenty-one and er Frets was interested in this er question of the inheritance of human characteristics we all we're all very familiar with the idea that er human characteristics like facial features for example do tend to be inherited how often have we seen mothers and daughters looking really very similar and sisters looking similar and brothers looking similar of course we see that all the time don't we so so there is a strong inheritance in facial characteristics and and Frets was one of the first biologists who really tried to get to grips with this quantitatively he asked the question well how can we measure how much of facial features are inherited and how much are just random occurrences that nobody can explain and er Frets did a a rather famous experiment he was interested in people's faces and and he he took a number of families and tried to compare faces of different brothers in the same family and er and er he he w-, he was particularly interested in head length now length is a kind of funny word to use but the head length is actually the distance between you sh-, you shut your mouth you see you start i stop talking and it's the dista-, it's the distance between here and the top of my head okay and you measure that in millimetres and that's your head length i suppose it's a length if the person's lying down flat with a tape measure it's a kind of natural length isn't it so so what i'm trying to express here was he measured he measured people's head length he measured this er i call it height really this measurement anyway and er what he did was he he he found a number of he he he found a number of families a sample of families where where he had er er two or more brothers in the same family and he measured the head heights or head lengths he called them er for the first son and for the second son and he tried to see how similar they were and er basically the idea is to is to show that that that brothers of the same family have faces which are much more similar to each other than just taking the sample generally from the population we know now that that is of course true but a hundred years ago it wasn't obviously true and that's what he tried to do find out so so this is this is just a little extract from what what he did so what so what what he er what what he what he measured which i'll now call X what he measured was the the value of L this this er this dimension of of of the head for the first son in a family and he also measured it for the second son and took the difference so the particular measure i'm going to talk about from his work is er the difference between the value for the first son minus the value for the second son and he did this for twenty-five families and so he got twenty-five values of X and these are my data and i will tell you in the next few minutes er you know er er talking about now what what question was he interested in well what he's interested in is er how similar these things are in particular one thing he might want to know is er obviously there are differences here 'cause there's an order effect this is the first son this is the second son so maybe er the m-, the mother by definition is getting older before she gets her second son you see so maybe the son changed over time so one thing he looked at was time trends and so the the question which i want to talk about now is the s- , surface question really of his work the question is er okay does L tend to get bigger or smaller is there any evidence that the son the first son has a bigger head than the second son or vice versa so if er i-, if er if the value of L gets bigger then of course X is negative and if the value of L gets smaller then X is er positive so really he's interested in w-, the sign of X essentially so er these are his data so i'm somewhat frustrated i don't know if have a projector 'cause i al-, really should have put the data on the other projector so it's all got to go here now i i'm going to arrange them there were twenty-five twenty-five families he measured from these these were his er h-, his data and i'm going to arrange these data in the kind of way that statisticians usually arrange data in what's called a stem and leaf plot and you'll see in a moment why it's a kind of that's a sensible way of writing our data so here are three or four values er minus-nine minus-nine minus-seven minus-six and i write them in order from the m-, biggest minus-one to big plus-one you see and then the then then there were some more families which you you get two minus-four er two minus-fives two minus-fours a minus-three and two minus-ones and then then then there was a nought there was a family where these are millimetre measurements by the way so this is a family where to the nearest millimetre their here might be two exactly the same and there was a one and there was a two and there was a four and then there were two families with plus-five seven eight and nine there was a family with ten a family with twelve family with twelve and a family with thirteen and there was a family with sixteen difference that's a stem and leaf plot it's just writing down the data but it's kind of ordered in a nice little way ranking them from the smallest to the largest and you notice i've grouped them in class intervals of width five see these are the minus-ten to the minus-five minus-five to nought et cetera and er the the the the advantage of being able to do that is you can immediately spot what the spot what the histogram looks like so if i just draw a tiny little histogram down here you can immediately spot there are four four er observations in the first group there are seven in the second four in the third five in the fourth and then another four and another one see so there's the there's the histogram of the of er there it's a typical sort of histogram you get in biological experiments and er and the question is we we now want to analyse these data in such a way that we we try and shed light on the question whether genetic longer or or shorter now i want to talk about two two approaches er the first approach is the T-test and it's called the T-test because it's based on the T distribution and it's it's exactly the same really as what i was talking about er last week i when i was talking about confidence intervals with a T distribution so so the first approach is to is to plot a er er a normal distribution over all this and to discuss it all in terms of inference for a normal sum so er what would be a sensible way of thinking about these data then from a normal perspective well er there's er there's my histogram see it's a sort of normal shape not very good really it's a got i mean i er we've only got twenty-five observations so that's as close to as a normal distribution as in fact as you you could ever get so normality is er probably a reasonable assumption and all biologists assume normality without worrying about it so we will as well so this is a normal has a normal distribution so in the usual notation this thing has mu er mean mu and variance sigma-squared so there's the model so that's an to the ingredients i was talking about reminding about earlier so ne-, so the next question is what is the null hypothesis well the question here we are just at the top there the question is whether there any evidence that L gets longer or shorter so the natural null hypothesis now is that er L on average stays the same so so er so that means the mean of X is zero which is mu isn't it so the null hypothesis is that mu is zero so that's the null hypothesis of a T-test and er next ingredient test statistic so what's a test statistic statistic are we going to take well we we need to er we need to analyse the data now so we need to er work out the sample mean in order to the T distribution and er and according to my calculations the sample mean is one-point-nine-six that's zero over there by the way isn't it so so clearly the distribution tends to be pushed over a bit to the right so the mean is positive one-point-nine-six and we also need the standard deviation sample standard deviation for the T statistic so according to my calculator this was seven-point-four-zero and of course as i've alrea-, already said N is twenty-five so we're all set now to form the T statistic for the the the the the T random variable which tells us about mean to a normal distribution so remember what that is so so that's just we have temporarily to go to capital letters now 'cause i'm talking about the random distribution of these things so remember it's it's the the the T statistic standard normal random variable on the top so it's er it's the the the sample mean X-bar minus its mean but we're constructing a test statistic now so we do that under the assumption of the null hypothesis and the mean is zero so we don't have to divide by we don't have to subtract anything 'cause it's zero and then we er the-, then we scale that by dividing by the standard deviation so we divide by the standard deviation but the standard deviation of X-bar of course is X over square root of X so that's just the same thing as sticking an N up there and er and er le-, let's get the numbers in so that one- point-nine-six times five square root of twenty-five divide by seven-point-four- zero and according to my calculator that gives us one-point-three-two okay so we've got a a value of the T random variable now which is one-point- three-two so so this is an intermediate set now we're getting a test statistic so what are we going to take as the test statistic well er this this quantity intuitively measures the difference between X-bar and what we'd expect namely zero and since the question we're asking ourselves here we are still at the top there the question we're asking ourselves is whether there any evidence that it gets either bigger or smaller we put mod bars around this thing because it's deviations in either direction which are equally interesting er intere-, equally interesting so the test statistic then is just the absolute value of this er T statistic so we now go to the calculating the P-value now the third ingredient in this er whole procedure so it's a l-, let me do that little diagram over here so here's a distribution of T okay so this is T on how many der-, degrees of freedom er twenty-five observations so it's twenty-four degrees of freedom you remember it's the N-minus-one because of the chi-squared argument behind all this so so there's a distribution of er T or twenty-four degrees of freedom centred on zero that's the zero and then we we look at the value we've got which is up here somewhere one-point-three- two okay we need some for this this is five really isn't it okay so so so what what w-, what we the P-value then is the probability of our test statistic greater than or equal to the one we've got and so i'll put in red now that's going to be that area there probability of getting a T-value bigger than one-point-three- two but we're going to put mod bars around the test statistic because deviations in the negative direction are also equally interesting so so i i put this area down here as well which is minus-one-point-three- two writers of elementary textbooks like talking about two-tails and one-tails and so on this is a two- tailed test because we're interested in deviations in both directions so so we we we work out we work out the er P-value which is the sum of these two red things and this is where you have to go to your statistical tables or go to S- plus if you have S-plus switched on in your desk which i do and so i ju-, i just get S-plus to work out for me the tail area the quantity the er cumulative distri-, distribution from T and er and i make that nought-point-one-eight-six so this is about nineteen per cent okay so the P-value is er nineteen per cent so what does that tell us general discussion over now what we're what we're trying to spot is if the P-value is small and this is this probability very small and that's a measure of how much of a fluke it is to get this value of T well it's not really is it it's sum of your chance nineteen per cent is er not really very unlikely at all so er so by any usual canons of significance testing whatever significance level you've used five per cent or one per cent or whatever you you wouldn't declare this to be significant so the conclusion is although the data show er the g-, data give a bit of a hint of a positive difference which means that face lengths tend to get smaller as you go from first to second son it's not significant the difference is not big enough to be clearly evident from such a small sample size so that's the that's the first approach to these data with the T T-test now let let me just mention another one 'cause it's i wanted to get over the idea that the choice of a test statistic is somewhat arbitrary and here's another here's another way of looking at the data which is called the sign test so it's another way of looking at the data and the sign test sign test is much simpler than the T-test the sign test just just so it goes back to the original question here we are still up there does L tend to get smaller or longer the sign test simply looks at the sign of the difference so if this is positive the L is getting shorter isn't it and if this thing is negative then L L is getting bigger so a natural null hypothesis now just simply looking at whether you get bigger or less for a value of L the natural probability er to look at now is let's look at the probability that X is positive okay so that means that the L is actually getting less and al-, also we could look at the probability that X is negative so that means L gets bigger and if there's nothing going on here if on average there's no trend over time you'd expect to see as often L getting bigger as it getting smaller you see so the another way of formulating the null hypothesis is that the probability that X is positive should be the same as the probability that X is negative and er if we have a normal distribution with mean zero then truly that satisfies that requirement but notice that this is a much more this is a much weaker null hypothesis than this one here the f-, the T-test puts all this baggage on it like assuming normality and things and working out standard deviation and things this is a much more crude this is a much more primitive way of looking at the data just counting up how many positives and negatives and testing this hypothesis so let's see let's see how many positives and negatives we've got so so can i F for frequency there so let me put F- sub-plus is the number of positive Xs and let let er F- minus be the number of negative Xs okay and er er the probability of positive and negative should be the same according to the nu-, hyer-, null hypothesis so on average F-plus should be the same as F- minus it w-, w-, never be exactly the same of course but on average er the the difference between these things would be zero and one will be the same as the other so so that's the null hypothesis this is what we'll look at now so we d-, we don't we don't have Xs any more we just replace them by the frequencies so so what is what's going to be a suitable test statistic now a test statistic is a function of the data but now we're only looking at how many positives and negatives there are so it's really a function of F-plus and F-minus i'll be looking at now what is a suitable te-, test statistic what is it about these things that would lead us to doubt the null hypothesis is true well it's the difference isn't it intuitively so if we just take F-plus minus F-difference absolute bars around that then that's a good test statistic the more different these two frequencies are the more evidence we have that er X is more likely to be positive or negative okay so this is the sign test now a completely different way of looking at data this is now the test statistic so how do we work out the P-value well let's get the data let's look let's let's look at the and i'm really very fed up we haven't got the anyway there's the data again how many positive and how many negative we've got four and a seven in the first two cells haven't we so we've got seven we've we've got eleven negatives how many positives have we got all the rest are positive except for zero we'll leave the zero out 'cause that doesn't tell us anything about which way we're going so we've got eleven we've got eleven negatives and the number of positives is the rest of them which is fourteen 'cause the fifteen to twenty-five but we'll leave the zero out so the number of negatives is er thirteen i'm so sorry number of positives is thirteen see that so we've got er th-, three there five there four there and one there and so we've got a total okay so the P-value then what is the P-value then this is the probability that by chance this is the probability that by chance F-plus minus F-minus is going to be greater than or equal to what we have observed and what we have observed is two isn't it there's only a difference only two between these two these two things so this probability going to be pretty big now if you just think about it so so how how are we going to get this probability well easy way to think about it is that we've got er fourteen positives and eleven negatives so so think of tossing a coin twenty-four times and we get thirteen heads and eleven tails that's essentially the same problem isn't it recast in terms of coin tossing so so this is the probability of the number of heads minus the number of tails in twenty-four tosses is er greater than or equal to two well two is a pretty small number so the easiest thing is to work out the opposite of that so that's one minus the chance that F-plus minus F-minus is strictly less than two now F-plus minus F-minus is always going to be an even number isn't it if you think about it 'cause if you put the number of heads up by one the number of tails goes down by one so this is so so the only exception the only possible sample result which doesn't satisfy this inequality here is if the two are exactly equal so it's one minus the probability that F-plus equals F-minus and therefore equals twelve and we can we know what that is don't we what's the so so that's the chance that if you toss a coin twenty-four times you will get exactly twelve heads and twelve tails in exact balance and we know what that is don't we that's twenty-four C twelve times one over two to the power of twenty-four okay so it it follows down to a binomial er probability as it must do 'cause we're talking about binomial distributions basically and if you work out that probability this is about point-one-six this probability if you work it out on your calculator so one minus that is nought-point-eight- three- nine so that's the P-value so that's the P-value calculated by this other way of er looking at it so the main point to note is we've we've got exactly the same data both tests are doing something which is very sensible first test looking at the mean of the distribution in the classical way the second test is looking at how many positives and negatives there are both are perfectly plausible and yet they give different answers and they give different answers because they're using different test statistics this one's look-, looking at the T thing with the baggage of standard deviation this test statistic is just looking at the difference between two integers so er so the so the moral of this example is i-, is is is the point that significance levels are not everything significance levels isn't really the full story we want we want to somehow get the idea here that the second test is actually worse than the first test we want to somehow take into get get in get get to grips with the fact that the first test looks at the data in much more detail than the second test we're making lots of assumptions in the first test like normality and things like that so we would expect to get a premium for doing the first method rather than the second method so in se-, there must be a sense in which the P-value for the first way method is better than the P-value for the second method and the only way we can get to grips with that is er by looking at the error rate the other way round we're not just interested in how often we reject the null hypothesis when it's true we've also got to worry about the sensitivity of the test we've got to worry about the probability that if ne-, if H-nought actually isn't true we've got to worry about the the probability that we'd actually detect that and we like a test which has a a a good sensitivity which i-, er going to be more likely to detect the er lack of truth of H-nought rather than just simply detect its truth and that's how the rest of the theory of significance tests goes which is what i'm going to talk about er just be-, beginning now and then er that's the main topic for next week really and what i'm going to be talking about from now on is much later work actually historically than P-value P-values goes back about a hundred years the idea of looking at these types of error goes back to the late nineteen-thirties so it's a a more recent er innovation so er finally today then let me define let me define er two more things i said i was going to have two more concepts to define here they are so the first thing to define is er i-, is i-, is what's called alternative hypothesis and er you see why now i have H-nought 'cause i'm now going to have H-one H-one is the H-one stands for the alternative hypothesis and H-one H-one is simply the complement of H-nought in in the sets that i-, in the set sets so if H-nought is true H-one is its opposite so H-one must be false and similarly is H-nought is false then the opposite of it is true so H H-one is just the opposite of H- nought so in set theory sense it's the complement so that's the first definition the alternative hypothesis so we've got two hypotheses going on now either H-nought or H-one and we're trying to decide which is which and the er the second quantity is the error rate the other way around so everybody uses a simple beta now instead of alpha so so this is now the probability that we're going to accept the null hypothesis in other words believe the null hypothesis is true but now calculate it on the assumption that H-one is the the the truth so this is the error rate the other way round this is the chance that H H-one actually is true in other words H-one r-, H-nought is really false but we conclude that it's true so we made a mistake of course so this i defined alpha to be the type one error rate a moment ago this is now the type two error rate so it's the probability of getting things wrong er the other way round and the theory of significance tests now goes down the route of looking at values of alpha and beta and now not only do we want to control alpha which i've talked about so far we also want to choose significance tests I-E choose test statistics for which beta is also small and that's what i'm going to talk about er next Monday and Thursday