nf0955: i wanted to start by saying that at the end o-, o-, well not quite at the end of the lecture i'd made a correction but somebody came and asked me at the end about the correction so can you just check in your notes where i was writing the variance of G da-, G-of-X approximately equal to G-dashed at mu right i forgot the squared initially can you check you've all i think you've probably all checked that i'd put the squared into all the other formulae but just not in here so can you just check that you've got that correct in your notes or you might get slightly puzzled when you come back to them right i was then going on to talk about the Kaplan-Meier estimate so Kaplan- Meier estimator and it just incidentally was a paper published by two people called Kaplan and Meier in nineteen-fifty-eight er just to show you that quite a lot of the stuff we do isn't late well second half of the twentieth century sometimes if you try to explain to people you're doing research in statistics they look at b-, you blankly and say what is there to do they seem to think all these things just appeared instantly er and nobody ever thought about them right and the notation i was using was you should just about be at this stage in your notes anyway this should already be written down so the notation was we're looking at T-zero T-one up to did i say T-N or T-K N sf0956: N nf0955: [laugh] er and we as usual have T-nought equal to zero and then i gave you a description of the intervals which i did correctly verbally and not correctly in the written form we think about the intervals I-I being zero T-one sorry i knew i just that's what got me confused you have to have that square bracket of the initial zero T- one to T-two and so on so the intervals are with the exception of at zero T-I T- I-plus T-I-plus-one with the square bracket so that you're actually including the interval doesn't really make sense t-, to put that down as an equation okay then we need exactly the same thing as in an actuarial life table except that instead of looking at intervals we're looking at point estimates so that for all those intervals I equals one up to N we let D-of-T- I be equal to the number who died at T-I and now we are requiring that we have the exec-, exact death time so it's not in an interval it's at a particular time and we're going to need again we're going to need to know how many people there are so for I-equals-one up to N and zero er N-of-T-I is the number alive and at risk at time I so those two are r-, really no different from the actuarial life tables the difference comes in in the censoring which i- , so again for any time we want to know C ah s-, C-of-T-I but this time it's in a different interval so it's the number censored sorry the number censored in T- I-minus-one T- I and the equality and inequality changed round we defined the time points to be the point at which deaths occur not at which points at which censoring occur so we can actually get things happening in an interval for censorings unlike for deaths and this is the point i was making that if we actually had a death at some value say eight doesn't matter what the eight is then the number of deaths at point eight would be equal to one but if there was a censored observation at time eight as well we regard that as happening after that death in other words it has to go if the because the interval will be something eight for the censoring where that's strictly values less than eight the censoring goes into the interval above this it's actually much easier to do this often than to er think about the worry too much about the formulae what it does mean of course is that you can also say therefore the number at time I-plus-one well what's going to be there at the number at time I-plus-one it'll start off with a number at the beginning at the previous interval and we're going to have to subtract the deaths that happened at that time and then we're going to subtract the censoring from the next time er oh sorry just used an abbreviated notation sorry minus er sorry you quite often do land up writing D-I it's putting in the T-I just to make it explicit that it's on the original all the times the other thing obviously we need is that the number at time zero is the total number in the study okay i've said obviously that in fact yesterday we were discussing how you modify this in the occasions when when that doesn't actually happen there's some interesting situations so this is really by way of setting up definitions which actually give us unsurprisingly an estimator that looks the same just make that T and once again we're going to have a product this time i'm going to write the product as all T-I such that T-I is less than or equal to T and not surprisingly that's the product over all those D-T-Is N-T-Is what we can also notice from this is that you can r-, write this more directly as a recursion by using the same product but let's write each of these as S-of- T-I and writing it in this product form is the background to the other name that this is given which is this is also known as the product limit estimator one of the other things that can be established on this is that again it can be shown to be the maximum likelihood estimator okay well you may need to essentially specify conditions on nothing the how is it not changing when there's no information and the other thing we're going to want to know is the formula and sorry the the variance and we don't need to derive the variance of it again because the variance is similar in the obvious way to Greenwood's formula for the l-, er life table if you do do some reading around on the survival analysis books you might find a couple of other approxim-, approximations to the variance that you can use particularly if there's no censoring if there is no censoring then at any point you're just using a simple binomial estimate so you can just use the binomial formula er in general you're going to be doing this kind of thing where you've got enough data to want to use a package which will calculate the variance for you so the simple approximations are not that critical okay i put some data up on Wednesday and er i have got an estimator Kaplan-Meier curve for that which i shall hand out in due course the main reason i'm not handing it out yet is i'm going to ask you to do something which has a solution on the back of the sheet er and that just goes through the calculation for you so you've got the list of times and for practice you can try doing it yourselves and then see whether you get the same answers one of the other things that's useful to do which may or may not endear me entirely to er people fiddling around watching what i'm up to er right i should get these lights out okay what's quite often you'll do with an actuarial life test like a life table is draw a survival curve and you can see the fact that we have estimates only at times in which events occur by the jumps whenever there's an event and you can see that the jumps don't always happen at identical intervals you can also see in this group this is actually gastric cancer with a couple of comparisons with time and days you can see a a very dramatic drop in survival so that five-hundred days you've only got about a third of the people still alive and then the survival levels off a little bit but the other reason for showing this curve again there's a there's a printed one in the handout i'll give you so you can have a look at that being drawn up the other reason for showing the curve is that of course generally what we're interested in is not just the survival i mean we may well want to know what the median survival is looks like about a a year in the radiation group with chemotherapy and maybe about a year and a half to two years in the other group but we actually would want to do a comparison of those in that group it looks as though the dotted line has got a better survival which is the chemotherapy- only group but how would we think about formally testing that well that's where we get back to a log-rank test which i'll talk about in formulae for in a minute i'm just trying to see if i can pick up let's concentrate on just that little bit of the graph at the moment what we could say at that point is that in the one group er there will be some number at whatever the time is and that's going to be the number from the chemotherapy and radiation group and of those chemotherapy and radiation group there was a death at that we cannot count the number of deaths at that time in this case just here there are no deaths whereas here there'll be some fixed number in the chemotherapy- only group and the death at that time we can see there is a jump so there must have been a death and we can think of that as then being a comparison of two binomials we could even set it up as a small two by two table and do a formal comparison of given the numbers in the two groups so if there were equal numbers in the two groups at this point at risk and we saw one death then we would divide that one death in terms of expectation into half in one group half in another and provided we can assume th-, independence we can actually do that along the whole curve and that's the basis of the log-rank test so to come back to the blackboard and i'll leave that a minute 'cause i can see a couple of pens blackboard and chalk in fact i think we're coming back to a couple of minutes of you talking about what i've just said and whether it makes sense while i clean the blackboard or anything else you want to talk about nf0955: okay have you got any questions on that no okay log-rank test i'm not really sure where the word log comes into this but the reason we talk about ranks is that we simply look at the order inven-, which the events occur and we don't look at how far apart they are in other words we've got the rank times but not the actual values of the times and the point of a log-rank test is to compare two survival gr-, two or more survival curves so if we wish to test whether two survival curves are equal in a non-parametric context so non-parametrically we can use the log-rank test okay as an aside that you don't need to write down but it's probably quite useful general knowledge non-parametric tests are essentially tests based on things like ranks that don't take the actual values into account we don't happen to teach very much about them at all on MORSE i think in fact this course is the only one that probably mentions them although i know at least some of you have discovered them in doing your reading of the literature er they tend to be very popular in the social sciences probably historically as much as anything else i've said we can use the log-rank test because there is also a Wilcoxan i've been asked about the Wilcoxan test this morning and the name of the Wilcoxan test that actually relates to survival has just escaped me there is a Wilcoxan test that's related to survival so if you're reading any of the survival texts you might find reference to both of them er anything based on ranks tends to in-, require a lot more hard work which is why i'm not going to describe it but it does exist non-paracmetric tests are around er so if somebody at an interview or anything asks you about them you've heard of them you just haven't studied them er so e-, end of that aside we're being non-parametric we're comparing two groups and as i said what we're really wanting to do is what we're going to do is look at each point but let's think about if we're comparing and doing a significance test we need a null hypothesis and it's slightly complicated in this context to write it down it's not quite as trivial as some other things so that the null hypothesis is that the cur-, the curves are identical I-E S for group one of T equals S for group two of T that's a form we usually can write hypotheses in but here we have to add in the comment for all T T greater than or equal to zero and you probably want to write what i just said where S-one S-two are the curves for groups one and two the for all T is the most powerful way of doing things in other words we want to compare what's happening on the whole survival it is actually very common in medicine to look at survival up to thirty days after an operation or survival up to one year that's certainly convenient as a summary it's just not as informative as it could be as a test so yesterday er er one of the medical professors was talking about survival after a difficult operation and that was expressed in terms of survival to thirty days for general discussion but the formal tests were done in terms of complete survival curves what we actually do is we again think of at about each point so at each time T-I at which a death occurs at which at least i should say one death occurs okay and that's one death occurs in either group what we do is we form a two by two table okay i mean in reality we don't form the whole table but that's what we're doing conceptually and what does that table look like died not dead group one group two and i'm probably going to swap to a yeah i'm going to swap to a shorthand notion drop the T so let's just make that D-one-I which can of course now be zero because just 'cause somebody's died in group one they don't have to have died in group two as i showed D-one-I D-two-I and this is total at risk at that point so N- one-I N- two-I two by two tables you've seen before and the standard way of dealing with those is look doing an observed minus expected comparison so the expected number of deaths in group one at time I is expected value of the random variable D-one-I we will put down as N-one-I N-one-I plus N-two-I times the total number of deaths that occur at that particular point okay that's nothing new to probably first year what i'm not sure is whether you're going to remember the variance expression for that anybody remember the variance might just about regret starting writing it on that bit of the board sm0957: is the not dead column the same as the total risk column or nf0955: oh sorry i've i was just being lazy and not filling in the to-, the the whole thing those were all the all those that are at risk of whom those died and these ones didn't die at that point you can could also write D-I N-I minus D-I where you're summing up over the subscripts right the variance term involves quite a lot of elements if you really want to s-, think about these tables in detail you actually land up with a hypergeometric distribution er which would be a nice thing to set as a exam question for second year but for this year we're talking about N-one-I N-two-I so those two multiplied by D-I N-I minus D-I so you're multiplying the four margins sorry yeah the four marginals together row margins and column margins and then what you divide by is a function of the total it's actually N-I- squared N-I-minus-one so if any decent sample size is just N-I-cubed and let's write that one again think it should be fairly clear that er again it's the kind of thing you write a program for rather than enjoy doing on your calculator because this is for only one time and we're going to need to think about it for a whole lot of times so in think i'll just got to clean the board anyway i'll clean these two while you or at least one of them while you decide if you've got any questions on that nf0955: okay so what we do that's the expected value for a single time point what we want to do is to let E- one expected for group one well it's the fairly obvious thing you're going to do you're going to sum up over all your times and use E er E-one-I where E-one-I is precisely the expected value of E D-I the n-, the expected number of deaths at each interval and similarly for E-two-I obvious thing we're going to want to do as well is have the an observed so we're going to have the same notation observed simply equals the number of deaths actually occurring in group one at each of those times and the other thing we're going to want is the variance which shouldn't come as a surprise either the variance as i said there's an independence assumption that we make the variance is the sum of the variances at each point and we could of course use a shorthand notation just calling it v-, V-I at each point why am i calling it V-I and not V-one-I yes sm0958: nf0955: 'cause if you look at that if i swap those round all i'd be doing is swapping the position of the N-one and N-two it would make no difference okay so we've got an expected value an observed value a variance so the log-rank test comes back to something that would often be denoted by a Z statistic so the log-rank test uses Z equals observed minus expected divided by the square root of the variance and it uses that that's why we tend to use Z compared to the standard normal there's another way in which it's quite commonly done alternatively Z-squared in other words something that's obviously looks like a chi-squared term O-minus-E squared over V is compared to a chi-squared on one degree of freedom and there's a reason for mentioning that nf0955: it's not immediately obvious how we're going to generalize a two by two table which is quite a nice thing to say a two by three table which is why one can think of a simpler version than the log-rank the log-rank is what you would use if you've only got two groups but if you want to think about a generalization alternatively we can use and as i don't use this very often i definitely don't remember it we use something that requires us to think about E- two which is pretty simple that's just the sum of the expected value sorry of D- two-I in the obvious notation and if you want to write it explicitly so that's N-two-I D-I over N-I er similarly O-two is the should really have memorized this the er obvious definition is just the sum of all the deaths the one thing about that Z- squared on one degree of freedom that doesn't look completely standard is to being divided by the variance so for this one we just use that completely standard form use X-squared equal to E-one-minus- O-one squared over E-one plus E-two-minus- O-two squared over E-two and anybody who feels really energetic can start playing with the formulae and seeing just how different they might get we're again referring to a chi-squared just on one degree of freedom but what can we say about this well the disadvantage why don't we use the simpler one well the disadvantage is that X-squared is conservative and by conservative i mean that if you had something that was at the borderline say over five per cent significance level if you did the log-rank test it would show as significant if you did the simpler test it would tend to show not as significant so that's what we mean by conservative but looking at that formula the advantage is well if you think of the question i asked you how do you generalize this to three groups if instead of having chemotherapy and radio-, versus chemotherapy plus radiotherapy we'd had a third group radiotherapy only how would you generalize this what's the obvious generalization sm0959: the term for E-three nf0955: you just put in an E-three term or however many you like because these terms are now pretty obvious so the advantage is ease of generalizing to N groups okay what i thought i would do i thought it might have been slightly nearer the middle rather than nearer the end of the lecture er i take it you've all got the dataset from that i m-, m-, listed last week 'cause i haven't written it down for the control group i'll give you the first few observations not necessarily all of them i mean i'll write them all down but you don't need to copy them all down what i want you to do is to try to write down the first couple of lines of a table to calculate what you need for a log- rank test er the table i've got has got ten columns so have a think about which columns and what you're going to put into those columns 'cause that way you're more likely to remember it if i decide to put this into an exam which is the other advantage of a simple procedure put it into an exam more easily okay so in the control group the times were two three nf0955: and i'm planning to ask somebody to come and write up what they've thought on the board so as a just a gentle aid to actually addressing the problem nf0955: probably give you another at least three or four minutes if not more and i probably won't ask for a volunteer probably just ask someone if i ask for a volunteer i know who's hard-working and who will probably have an answer so those of you who are quiet i've no idea how good you are nf0955: okay i'm going to resist the temptation to use the advant-, the fact that i know some people's names and not others okay er but what i'm going to do is go for colour so gentleman in the nice red sweatshirt [laugh] i think you guessed that one was coming when i said colour [laugh] come on come and write down what i don't doesn't matter whether you've got it right or wrong sm0960: down nf0955: yes but you've had a discussion 'cause i can see that sm0960: [laughter] nf0955: [laugh] did anyone come up with anything about how you're going to tackle it i can start working systematically through all of you namex what column headings would you have had sm0961: you've got to look at the it has something to do with two sets of trials you have the control group and the the nf0955: drug group sm0961: the drug group from last time nf0955: yeah okay so what do we need to have in those er let's go to the back row what what information are we going to need to have on those to fulfil the formulae sf0962: nf0955: so we're going to have the time so we're going to have the number dead at a particular time and sf0962: nf0955: total number at risk and i should probably put in that T is one for control drug yeah so most of you worked that much out yep how are you going to work out the ex-, what do you need to work out the expected numbers in either of those groups oh i rubbed the formulae off but you've got in your notes sf0963: total number of deaths nf0955: D-T think i'll sw-, swap between T and I and therefore so the total number and that will allow you to go for E-one at time T E-two at time T and the variance at time T so you're all now going to be able to remember that without having to be told it she says cheerfully er what's the very first time you've got in the datasets we're talking about and i should actually say er we're starting with twenty-two people in each group what's the first time between those two datasets middle of the threesome what's the first time sm0964: er T nf0955: right and what do we need to fill in for the rest of that column sm0964: sorry nf0955: what do we need to fill in for the rest of that row sm0964: er the deaths nf0955: yes sm0964: i er er nf0955: which are [laughter] sm0964: er two well one in each so two for the next one er yeah [laughter] forty- two nf0955: it's actually forty-four it's the tot-, it's the the two together sm0964: nf0955: yeah then goes down to forty-two for here expected it's really easy let's go to the person sitting on his own expected number of deaths in each group well you've got you've got a f-, er formula up there what's the expected number of deaths in group two how many deaths in group two at this time sm0965: nf0955: yeah one how many deaths in total sm0965: nf0955: and how many if it was total number oops sorry it was total sorry i'm putting the forty-four at the bottom and the twenty sorry total total number er at risk in one group twenty-two total number forty-four and sorry i'm going to write the answer down much more easily which is what i did rather than writing the formula down let's just do it the way i was thinking of it the total numbers which is the way you just think number in this group is a n-, proportion of the number of that group is a half which is why i was writing a half down and the total number of deaths was two so the expected number has to be equal to one this one's a really easy one 'cause there's one death in each group and the group sizes are equal so one expected one expected and the variance term okay the next time is six there's only one death but the group sizes are equal again sm0965: nf0955: sorry sm0965: are they not three and four nf0955: oops i'm sorry this is time three not time six [laugh] thank you [laugh] er at time three there is one death in the control group no deaths there group sizes are equal so we can see the expecteds come in at a half each and for the rest of the table it's all written out in the handout and that wraps up the non-parametric side of survival analysis actuarial life tables Kaplan-Meier or product limit log-rank to compare survival curves so on Monday we'll go over to parametric methods for survival so any questions and the handout is at the front so you don't need to worry about copying down this 'cause it's all on the handout