nm0658: this is the last of five sessions which are specially for your group to look at the design analysis of experiments er from next week we are going back to the general group and we're going to look at advanced methods which i think are equally applicable for everybody so we're then going to look at advanced modelling ideas which i think are the same sort of models you need to consider whether you're doing a survey or an experiment the two main problems we will tackle there will be what happens if your design is unbalanced and some experiments are unbalanced and most surveys have unbalanced data and the second thing we will tackle is what happens when your data may not come from a normal distribution the traditional statistics says that if your data come from a normal distribution then everything is fine and if they don't come from a normal distribution then you first panic and then you transform your data and then you hope your panic is over and there are modern methods now where you can analyse data from a non-normal distribution much more flexibly than was possible before so that's what we're going to do together with the er students from wildlife management er and also from vegetation surveys so what i want to do today is to review the ideas er of experimental design and analysis and then go through one more advanced topic which is a very common topic and which is the subject of repeated measures so that's where you measure repeatedly on the same unit be it the same animal the same tree or the same plot you go back to it perhaps repeatedly through the season and what i want you to do with this single example of a more advanced method is to see whether the weapons you have learned through the last four weeks how much can they help because what i would like you to get to by the end of this course is er so that the subject of statistics is not always expanding you've got to give yourself a framework so that when you have a new problem you say well what parts of the problem are new and what parts can fit into the the subject that i already know and i want to show you that although repeated measures appears to be a new subject once you understand some of the basic ideas we've tried to cover then there are only a few extra changes to be able to follow the ways you could analyse a repeated measures example so what i'm going to do this morning is to review some of the ideas of designing experiments bringing in what Genstat can do to help you look very quickly at data management er and then go into as part of the analysis how to analyse a repeated measures experiment [cough] you should have just two handouts although it says here you have three i couldn't think of anything extra to write for the worksheet so this is the one week where there isn't a worksheet so you have a lecture note and you have the practical exercise for the practical er which is as usual in the Met department at eleven o'clock [cough] now we're going to look at the ideas of designing experiment and i've kept repeating that you start with your objectives they lead to the treatments the way you're doing your experiment leads to the layout you must decide on your measurements and we've now moved for the last four weeks to do the analysis and we're going to consider these components in turn yet again and see at the same time how Genstat can assist with the randomization of an experiment so the menu that you're going to use here we see Genstat it's not very clear but it's probably clearer on your slide er that you're on the usual stats menu but instead of going to analysis of variance you go up one to an option called design and there are a variety of things you can do and all we're going to do is look at one or two standard designs and we're going to use them first to show you how to randomize an experiment but also to review some of the ideas of different experiments here's the sort of menu you get and almost all the experiments that we've been discussing you can randomize using this menu for the very simplest experiment you might consider is a one-way design where you've just got treatments and the experiment is laid out in randomized blocks so then in this menu you have to say what do i call the blocks we've just left it with an A block and how many blocks are there what are we going to call the units well we'll call them plots and what are we going to call the treatment factor which might be variety here i've just called it treat and how many levels sm0659: one question please what is the criteria for one-way or two-way two- ways design nm0658: you mean this sm0659: yeah what will be the criteria for this to choose one-way or nm0658: two-way designs sm0659: yeah i believe there must be a two-way nm0658: this is where on the second component i've said that you first think of the treatments if you're if you have one treatment factor like variety that is a one-way design if you have two treatment factors let us say three varieties and two levels of fertility that would be two treatment factors and then you would have a two-way design in randomized blocks sm0659: different ways for the nm0658: well when you have sm0659: bias in the experiment nm0658: when you have a one-way design it's simple whether you call it a factor or a treatment it's just the same thing when you have a two-way design take the example i've given you where you have three varieties and two levels of fertility you can either think of this as a two-way design for the treatments or you can think of it as a one-way design where that one way has six treatments because it's you can think of it as three by two or you can think of it as the six and it's up to you whether you want to think of it as just six treatments or whether you want to say well i would like to know right from the beginning the layout of my experiment in terms of both factors so when it's more complicated than a one-way design you can choose whether to take all combinations of the treatment factors and call them one big factor with all those levels or to split it up by the components [cough] sf0660: do you ever have three-way design nm0658: ooh lots of times sf0660: so it's like er on the er er dialogues within nm0658: yes sf0660: and you just have one and two-way and then generalize the model nm0658: this is this parallels here what you would have for the analysis sf0660: yeah nm0658: and you can either choose to do the analysis thinking what is my design let me give it a name or and i prefer to say it's a general design sf0660: mm nm0658: how many treatment factors do i have and this menu system because it's for simple designs only it only allows you up to five factors but five is probably enough so if you go to the general then it would say how many treatment factors do you have and you could have up to five different treatment factors there's no limit in Genstat on the number of treatment factors but we find most experiments five is enough i would like one of the main bits of work i did when i was working in er in Niger was to try and encourage people who only had experiments with one or two factors er to include more factors once you click on okay this is what you will get so you will now get a Genstat spreadsheet and you can see here er let me return to this previous slide just to remind you what i did this particular example i have four blocks and three levels of the treatment so it's a simple design i have twelve plots so there's a twelve down at the bottom which Genstat works out automatically just multiplying the four by the three that this was an experiment with twelve plots so when i click on okay it will produce the randomization and automatically you will now get i've called it data but it's really just the design you will now get this structure in a Genstat spreadsheet now i want you to notice and we will emphasize this later on that this way of laying out the data is part of the simple data management and i've seen one or two examples this term because you are all addicts of Excel you will find that Excel gives you too much freedom in laying out your data and that can cause lots of problems later on so i emphasize what we discussed in session three that when you have twelve plots in an experiment the layout of the data for a statistics package is to have twelve rows of data one for each plot you will have the label for the plot and you will then have another column which says which block am i using which plot do i have and which treatment and these treatments have been randomized so there's your randomization so this is your data laid out in the sort of field order ready to add further columns if you like with your measurements once you take them and so this is all ready to be a data collection form and this is in Genstat so the next part of your strategy you see is to say well after i haven't got any measurements yet because i'm just designing my experiment but when i've got my measurements they will make more columns here do i want to enter these measurement data into Genstat or into Excel i can choose so if you choose to save this as an Excel file which i did then here is the same information saved in Excel so now as you see you've got A B C and D in Excel and once you measure a few things the height of the plants or the mean height of the plants er the yield and various things those just become more columns which you can enter into back into Genstat for the analysis and you will remember from the er session on management we have said please enter your data straight from the field in the randomized order and now you see that's a very easy routine you can start in your stats package to randomize your experiment if you choose to do your data entry in Excel then that's fine you export all the details before the season here you now collect data and as you collect the data you just type in your extra columns of data and your data are ready entered and in many of the courses that i give i try and encourage scientists not to enter data at the end of the season when they've got everything collected but if after a few weeks you've measured the data fifty per cent flowering then the day you measure it you just type it in it's very quick and so it's there in your field book and in it goes same day and then if there's a problem you can probably have a look very quickly and you've seen how easy it is to having even if you enter in Excel to then get your data into Genstat for the analysis so the very day you collect your data there is no problem doing your first analysis and the procedure works very nicely any questions so far [cough] the second example this now can follow up your initial question this is now an experiment where i have got three factors so i've got three treatment factors i'm assuming irrigation with four levels i'm assuming fertility with two levels i'm assuming variety with three levels so i'm assuming a much more complicated experiment er how many treatments sm0661: twenty-four nm0658: twenty-four treatments so we're just taking we're going to have an experiment with all levels of these this would be a simple experiment which would mean the total number of treatments is twenty-four so i could enter this as a one-way design with a treatment factor going from one to twenty-four and associate each number with a combination of irrigation fertility or variety or i could choose to enter straight away and say i want to do this and er randomize it straight away for my irrigation fertility and variety i've decided to have four replicates and i assume also i've decided to have a split plot design and in my split plot design i'm going to have my level of irrigation and fertility on the main plots and i'm going to have my variety on the subplots i assume i've decided this maybe you decide this in conjunction with a statistician before you come to the experiment now you decide that's the design i would like to try [cough] so how many subplots in this experiment sm0662: nm0658: how many how many plots altogether you've already said there are twenty- four treatments sm0663: ninety-six nm0658: there are ninety-six plots just twenty-four by four because i've got four replicates and how many main plots sm0664: eighteen sm0665: thirty-two sm0666: sixteen nm0658: sixteen let's have a look well the corresponding design dialogue this is an answer to your question now can you have more than two i now go to the general design i've chosen to make it split plot so i'm going to the general split plot design and it will now say to me how many treatment factors do you have so i must understand these questions what do i mean by a treatment factor which i hope you understand quite clearly from the course and how many of those treatment factors are on the subplots well my design that i was considering had two treatment factors on the main plot and one on the subplots how many replicates four on my main i'm going to call the main plot factor M plot i'm going to call the subplot factor subplot whole plot treatment factor one i'm going to say is irrigate whole plot treatment factor two i'm going to say is fertility with those number of levels and subplot treatment factor one variety has three levels sm0659: why is that nm0658: sorry sm0659: why is that nm0658: i chose that i decided that was my experiment you can have any number so you have to decide how many varieties do you have to compare sm0659: er why did you call those subplots variety nm0658: because i thought that that would be a factor which i didn't need big plots for that's part of the design process you have to you have to decide do you need the same size plots for all your factors if you don't then are there any factors that need larger plots than the others here irrigation obviously often needs large plots sometimes fertility levels need large plots because you get leaching from one plot to another and often varieties you can have quite small plots breeders have very small plots so i'm assuming that either because of your expertise or in discussion with a statistician you decide that you can get away with small plots for variety but you need larger plots for here and that's why you chose a split plot design if that had not been the case then as we've said before i would recommend that you don't have a split plot design you've have a randomized plot design and then you you'd just have a different menu and you just say these are the three factors and what i want to show you in a minute is that we'll come and check well was it a good idea our design sf0660: can i just ask what's the randomization seed is that the way of generating random numbers nm0658: that's correct er yes i haven't explained these things down at the bottom the randomization seed means that that's the point at which it starts its random number generation so if you were to make a note of this then you could always regenerate the same randomization yet again you don't have to keep everything by typing this number in yourself so usually you allow the computer to choose this and it's different every time but if you chose to make it the same you'd get the same randomization and that's quite a good way of keeping a record of the randomization you had without noting everything down if somebody would say please could you print me another copy of that well you can always print the resulting spreadsheet but if you said well sending the information all you have to do is fill this in this way and give that random number seed and you'll get the same randomization and this is the randomization that follows so this is the output when you press okay it goes on of course down to ninety-six plots we only have fourteen plots here but you see it gives you a column which says which block which main plot which subplot which level of irrigation which level of fertility and which variety these of course are not randomized but the this which is the level of irrigation fertility and variety that is associated is randomized to a certain extent what you should notice is that the variety is randomized on the little plots so here's the first main plot which goes one one one while the subplots go one two three you will notice that irrigate is at the same level one for all those three because that's at the main plot level and so is fertility but the varieties go one two three because they are randomized on the subplot level and when we go to the next main plot here we have main plot two subplots one two three these two again remain the same because it's a split plot design and these three this is a complete repeat of all the values here so each main plot is a total repeat for all the levels of variety but it just is one plot from here sm0667: er if they're completely randomized will it be numbers one to ninety- six or nm0658: er sorry sm0667: if they're completely randomized nm0658: if it were completely randomized sm0667: mm nm0658: without even any replicates then it would randomize everything for the numbers one to ninety-six so you can choose do i have replicates which is a sort of blocking idea er do i have treatments and how do i randomize you probably wouldn't use this menu if you just wanted a set of random numbers there's an easier way of doing that but that would be the extreme for the randomization that's right now you asked just now about this randomization seed but there are other options down at the bottom that you could also ask for and i've used one or two of these er i've asked could i have a dummy ANOVA table which might answer a bit more of your question why did i choose to put varieties at the subplot level what i would like to see occasionally is is the resulting design a sensible one for me to continue with and so i could learn a little bit about that with a dummy ANOVA table and i will show you in a minute what that is i could ask for trial ANOVA with random data what it does there please don't make too much of that the random data is only there to show you what will the results look like from this sort of experiment when you have some data how will the results be laid out so it can show you the way in which the results will be laid out and i think that's quite useful ahead of the experiment to say is that a sensible thing so this is the sort of tool that i would like also for discussion with people such as yourself when you're saying i'm thinking of doing this sort of experiment and i can say well this is the sort of result you will get are you happy A with the design and B that you can then understand the results so let's show you what this works out because i ticked the dummy ANOVA table here i have the dummy ANOVA from Genstat i asked you how many main plots and you see this is straight print out from Genstat and here you see here is the main plot section so this is all revision but you can see that from the point of view of these two factors this is a simple experiment not with ninety-six plots but with ninety-six divided by three plots so there are thirty-two plots thirty-two main plots which is my twenty-one plus three plus one plus three plus the three plus the extra one so here we are at the main plot section so when you want to evaluate if this was a good experiment from the point of view of the main plots you can start having a look just up here was it a good experiment to study irrigation and fertility and the interaction and then when you come along to the subplot section you come along here and you have all the degrees of freedom down here ninety-five is the total because that's ninety-six minus one there's your n-minus-one [cough] i haven't given this but you'll try this in the practical Genstat also provides a sample ANOVA with random data to show you how the results will look and i find that quite useful in teaching and we're asking you to get this in the in the practical to have a look at that and the sort of question you can now answer is how would the degrees of freedom change if you moved fertilizer to the subplots if instead of having two main plot factors and one subplot factor you said well fertilizer could go on the little plots i wonder w-, how that would look and you have now two ways of doing that you can do that from first principles yourself or you could run the randomization again and move one of the factors from the main plot level to the subplot level i hope you can see from here if we move fertilizer from here down to here then this would come down here and also this would come down here so we would now be doing an experiment with three degrees of freedom there and just three values here because this stays and so this now the intera-, the residual would have nine degrees of freedom we've got ourselves rather a small experiment at the main plot level that's not such a good idea so it's those sorts of things you can now do either by studying this or by running through the design again and that's what we've asked you to do in the practical any more comments or questions on this design part i'm now moving to data management sf0660: er i'd like to ask something on the number of plots that you have for your er main blocking practice the main treatment practice nm0658: yep sf0660: er from where he's given us the three factors nm0658: yes sf0660: er i don't know if you've got irrigation as four levels fertility two levels how do you then get can you not say that you had nine plots for your two nm0658: with the design as randomized that's what you've got sf0660: so for your main plot from that did you say you've got how many plots nm0658: for my main plot i said i've got thirty-two main plots which is the ninety-six that i had divided by three sf0660: but if you didn't go through that ANOVA table er is there any way of working that out just from what you get at the very beginning er structure nm0658: yes if i didn't have that at the beginning i would say that at the main plot level what i've got is four blocks sf0668: nm0658: and four levels of irrigation which is four times four sf0668: nm0658: times two levels of fertility which is four times four is sixteen times two is thirty-two sf0668: nm0658: any more questions the second topic sm0669: er excuse me er er did and the design in the in in reality nm0658: yeah sm0669: i was thinking how about three factors plot that's to say maybe to have the er irrigation out the way then the fertility as a subplot and then the variety as the sub-subplot in that case we will have been having more of a how will i call it degrees of freedom than irrigation and the fertility as a main plot and then having the variety as a subplot how would you nm0658: okay can anybody er let me repeat the question and then i'm interested if any of you can now provide an answer you are now you've had fifteen weeks of statistics so you're all semi-statisticians [laughter] now the question was that this was er deliberately something a little more complicated as an experiment than you've had before it had three factors and the question was well if you have three factors why did we just have a single split or if i understand your question correctly why didn't we have two splits where perhaps we have irrigation on the main plots and then we have er fertility on the sort of middle-size plots and variety on the little plots and that is quite common does anybody have any comments on that suggestion supposing that when you're back home and starting to work somebody says i'm doing a three factor experiment so i'm doing it on a split-split that's called a split-split plot design very difficult to say if you've had a few drinks [laughter] does anybody have any anybody have any comments on whether they would encourage that whether they would like that sm0670: nm0658: ah [laugh] sm0670: and it was how would we differentiate the effect of the two that you combine in that one like the the variety the irrigation in the block nm0658: yes here sm0670: combine make the main nm0658: i had here i had irrigation and fertility on the main plots sm0670: yes nm0658: yes sm0670: how would you s-, nm0658: how sm0670: how would you differentiate the effect on your experiment of nm0658: well if you look at the ANOVA table that's the same question as though you just did a randomized block design this is like a randomized block design and you'll have a set of treatment means which give you the mean for each level of irrigation and you will have another one which gives you the mean for each level of fertility just the two means 'cause there's only two levels and then you'll have another table for the interaction and that is one reason if you're not sure how that's going to look in practice that's one reason why Genstat gives you a dummy analysis not just with the degrees of freedom which you got here but also with random data so you can see how all the means will look and you can see i wonder if that will give me sufficient information to understand all the components of the treatments that i've applied to my experiment i think it's a similar question that most people feel that if they've got many factors they're much happier if each factor's at a different sort of level which leads you towards having two factors in a split plot experiment and three factors in a split-split plot and i hope you never have five factors because then you've got huge plots and and so on does does anybody have any thoughts about would you encourage sm0671: well just a general stab in the dark nm0658: yes have a s-, sm0671: you split your plot into levels of fertilizer and split it again and they're very small plots the smaller the sample er if you have a larger sample variations tend to be absorbed in a large sample rather than a small sample is that nm0658: that that's that's almost there sm0659: depends on the way degrees of freedom nm0658: degrees of freedom will come in er you would it would be like having the top level it would be like just having irrigation at the top level does anybody have any general feelings about whether they would encourage experiments to be at lots of different levels or whether they would re-, prefer information to be at a single level sm0672: sf0673: wouldn't it depend on what you're looking nm0658: i-, sf0673: at what you were interested in 'cause if you want to to find er equally the effect on irrigation and part of the effects of fertilizer and variety and on interactions then by splitting it up into several levels you're going to lose a lot of degrees of freedom for your upper levels nm0658: right sf0673: er so surely the information that you'll obtain will be er you won't have er i don't know how to finish that sentence nm0658: er let let me try and finish it for you well maybe try and ask somebody else sf0673: mm nm0658: because i think you were voting for the side of not having too many levels sf0673: mm nm0658: er there were other people who i think instinctively said let's have more levels is there anybody who would who approves of the idea of having lots of levels sm0671: in blocks why didn't we use one of those factors nm0658: right e-, each one going down sm0671: yeah that's the way they nm0658: right okay having your idea of having two factors in the split block design and three factors therefore in a split-split block design is extremely common my view is your other intervention which is to say let's not have too many levels unless we need to so the general view i have is that lots of levels causes complication if you can have your experiment at a single level life is simpler all your tables are compared at the same level all your plots are the same size and the analysis the design is simpler and i think the analysis is simpler so the split plot analysis has lots of different levels and you will find when you look at the standard errors that that indicates that the analysis becomes very messy and complicated so i would prefer not to have split plot experiments the only reason i would have a split plot experiment is if some factors need large size plots like irrigation and other factors don't and then if the l-, if irrigation needs large size plots and we want to have irrigation and variety if we want to have no different levels then all plots have to be very very large whereas because variety only needs small plots we can choose to split the large plots that we need for irrigation into subdivision that seems a good reason i find there is no other good reason for having split plot designs we will see later today that the whole nightmare of repeated measures analysis is a similar argument that the repeated measures are repeated within the plot they're like a sort of split plot where time is each measurement and you will find as soon as you have lots of different levels in your data your analysis is getting messier so i'm not very comfortable with the idea that if you increase the number of factors you also increase the layout problems remember at the very beginning we've said that when you look at design please think of your treatment structure we said we want three factors because that ties in with our objectives and then please think of your layout now what the people who do split-split plot all the time are doing is they are thinking of these two together the treatments and the layout and they keep confusing them together and i would like people to think of the treatments first and say i could satisfy my objectives by having three factors and there'd be lots of objectives i could satisfy now can i have a very simple layout a simple layout is a randomized complete block sort of layout with twenty-four plots in this case for each block and so there are no different levels for the treatments and if you can manage that please do it if you can't manage that you say well maybe i'll have to go to a split plot sort of layout with two levels but don't volunteer for it and say because i've got two factors i will automatically go for two levels i think that causes many problems in the analysis of experiments and is part of the reason people don't exploit their data as much as they could sm0669: er my last question er would you be able to tell what a reasonable portion is statistically speaking of the effects of say irrigation confidently then confidently them i was thinking when you split them and you could do these things see i would have gone in for irrigation as the main and then the er er fertility i mean the maybe the fertility that and then gradually move on to the variety and have an idea whether even there is an interaction between variety i mean irrigation nm0658: if you lay it out in a randomized block with just one level you can answer all those questions you can askn-, answer the questions about irrigation about variety and about the interaction whether it's laid out as a split plot or whether it's laid out as a randomized block so the questions you can answer about the treatments are the same there are some people who would argue that when it's laid out as a split plot compared to random and let me come back a step when it is laid out as a randomized block you are sort of treating each factor equally they are all on plots of the same size one of the arguments given in the textbooks for having a split plot is where you want more information in this case let's say on variety so you're going to put them on the little plots and then they're always close together and you ye-, you want less information on the irrigation so you put those on big plots that are by definition further apart on average and so when you compare a randomized block with a split plot in the split plot according to the textbooks you get more information on the subplot factors and less information on the main plot factors the problem i have is that you get much less on the main plot compared with a very little gain on the subplot and also that does not account for the fact you also add in unnecessary complication in interpreting the results which means that i don't like that as a reason for doing a split plot experiment the only reason i like is the fact that you have to for practical purposes and there are many of those okay the next subject then we'll have a break and we'll discuss repeated measures after the break now what i hope to show you is that the ideas of data management which we distributed in session three which was very early last term that was data management for any sort of data when we translate those ideas into experimental data experimental data are quite simple usually and so you shouldn't have any problem if you manage the data sensibly and if you keep the principles so we're going to review the standard ways of entering experimental data which i think follow automatically if you've understood the ideas of design and what we're also going to show you is what happens if your data have been entered differently and this is where we're hitting a new problem now because of all these people as i've said before who are maniacs for Excel and that means you can enter data in all sorts of crazy ways doesn't it er and and then you can have problems reorganizing your data so you can do the sensible analysis and i have to tell you that as statisticians now this is serious because in the olden days we found that we spent all our time helping people on the analysis now most of your time seems to be spent on rescuing your data from poor data management the analyses are very quick you just click on the ANOVA button in Genstat so the analysis step is very very quick and you've done it many times this term the step which isn't quick is reorganizing your data because they were entered in a funny way so there are two ways to avoid this and i want to indicate both of them the first is please enter your data sensibly and then you won't have this problem and the second is if you haven't entered your data sensibly please don't re-enter please use the computer to reorganize your data but accept that don't get annoyed at either Excel or at Genstat or Minitab because that's taking the time i'm afraid it is the data manipulation that does take the time so allow a little bit of time for that so we're going to show you both and you either need to reorganize the data because you've entered it in an odd way or because alternative analyses require different layouts so sometimes when the data aren't so simple you might have three different analyses and for one analysis it was good to enter the data across and for another analysis it was good to enter the data down so you need to become a little adept at data manipulation or you're going to waste a lot of time and a lot i do mean it perhaps i'll er tell you one er not horror story but close the subject of data management is not well reported in the textbooks when i arrived back a couple of years ago one of the first advisees i saw was somebody who was just finishing his PhD at Reading and he had been on this course but we hadn't been discussing much about data management in the olden days and he'd followed a little bit of the notes on how to organize the data and it was discussions with him that clarified to me that we must include data manipulation data management in this course which is why we've changed this course to include this he had an experiment which was done here and he'd done his experiment at two different places for each of the three years of his PhD and each experiment he had measured ten different things so he'd had ten measurements it's all very simple these were very simple experiments he had forty-eight plots in each experiment he had now looked in a textbook and he looked at his notes to see how to enter the data and he had entered the data quite nicely organized with three columns the first column was the yield or the measurement the second column was the block and the third column was the treatment there were twelve treatments and there were four blocks so he had measurement block treatment because he'd looked in a textbook and textbooks only seem to deal with experiments when you have a single measurement which is unlike the real world where you always have lots of measurements he decided to enter his ten measurements in ten different files so he now had ten little files each one with three columns the block and the treatment columns were the same and the measurement was different and then he had six experiments so he now had sixty files each one with three columns in and each one followed precisely the method of analysis he'd been taught in his course and he'd finished his analysis and he was three weeks away from submitting his PhD and his supervisor looked at his information and said there are two interesting things i would like you to do in addition the first is that i've noticed that you have measured the yield in two different ways i'd like you to do what's called an analysis of covariance where you adjust one measurement for the values of another measurement could you please do a bi-, a simple analysis of covariance for each of your six experiments and he gave him a very simple thing and showed him it was in the textbook the second question was even more perplexing to him he said could you do a simple combined analysis where you combine the information for the six experiments together to see how the results you have reinforce each other and so he came to statistics to do this he knew no computing he only knew how to turn the computing handle so he was now being asked a simple question of data management well it would have been simple if his data had been organized sensibly but can you see that this idea of covariance which would have been trivial with Genstat normally meant that he had two data files which he had to merge together because the two columns were in different files nobody'd taught him about merging files and the combining of the experiments the six experiment meant not just that you merged the files but you then had to put them end to end because he had forty-eight but now he wanted forty-eight times six with another column which said which experiment it was so when he came with three weeks to go we explained these ideas to him but because he had no concept of data management or of computer ideas such as merging files he never succeeded the only way it could possibly have happened is if somebody else had taken his data and done it all for him that's not his PhD or if he'd learned a little about data manipulation and these were very very simple tasks i have to say these are much easier tasks now you've got Windows than they were when you had to merge files using DOS commands but nevertheless this was not possible for him to do and that's data manipulation taken to its illogical extreme but i have to say that the way he was encouraged to enter the data first was exactly what was recommended in a very popular textbook which we use on our courses says when you're entering your data into the computer this is the sort of layout and it encouraged the layout which caused his disaster because it didn't say of course in practical experiments you will have more than one measurement and just put them end to end don't make them in a different file so let me just remind you this was the yield data that we discussed in session three you have nothing new you need to learn if you understood session three notice that the way that we have laid this out with the block numbers is exactly the same as the way Genstat has just randomized your experiment at the beginning so here are the blocks the repetitions and the treatments and here are the measurements that you just type in when you get them so as long as you don't mess things up from where you started the rules are very simple how many plots do you have each plot becomes a row each column is a measurement i can't think of anything that's more simple and that's what you have to do you will find that there's one complication which we do discuss in session three which is what happens if some of the measurements are made at the plant level or the plot level here sorry the plot level and other measurements are made at the plant level so for example it's very popular to measure the yield by harvesting at the plot level but you might measure the height of twenty plants in each plot i wonder how you would enter that to which the answer is either in Excel or in Genstat you have one sheet for your plot level and you have another sheet for your plant level and we've covered that both in the discussion in section three and in the practical that you did so as long as you're happy with those you don't have a data management problem you should enter your data in this way i can leap ahead a little bit to these repeated measures ideas supposing that here's an example i'm sorry it's a bit in French but this is the weight of small potatoes middling potatoes et cetera we're going later on to consider repeated measures which is measuring things after twenty days twenty-five days thirty days and things like that how should you enter those data answer they are just measurements so enter them across just as though you were entering different measurements so just because you made the measurements which differ in time don't get more complicated just treat them as a measurement and so if you have six repeated measurements just they're six measurements enter them across they haven't changed the number of plots in your experiment so they go across this is how not to enter experimental data this is the same data here we have the treatment information and here we've entered the data for replicate one replicate two replicate three for one measurement that is very popular isn't it [laugh] i i'm sorry i'm i'm looking at you because i-, i-, you know it's not you that's made this as a this is encouraged in many departments and it seems very obvious and it works quite well for simple problems also notice that if you were doing the analysis by hand in the olden days this is exactly what you would do because you could work out the mean of these and you could get the mean for that treatment so you could get all the treatment means down here and all the block means across here so if you are in the habit of confusing entry with analysis this is wonderful you can use Excel to work out calculated columns and it all seems to work quite well until you try and do a full analysis and then it all falls apart do not confuse analysis with entry if later on you want to analyse your data either with Genstat or Excel you can enter the data in the proper way that's like this if you would now like to look at the data like this i've said this is how not to enter experimental data if you want to look at the data like this that's terrific it's called tabulation you enter the data in long columns and then you say please tabulate the columns and across the top i want one factor and down here i want the other factor and then i'll put some summaries i don't have a problem looking at the data like that i have a problem entering the data it's confusing entry with analysis in Genstat that's called tabulation in Excel it's called pivot tables i don't care h-, whether you do it in Excel or whether you do it in Genstat so enter your data in Excel the proper way enter your data in Genstat the proper way if you want to look at the data like this tabulate the data present the data very nice you can see what's going on you can see what a nightmare this is for your data entry by looking and explaining to somebody how you would enter these data in that format because that only works if you've got one measurement but in the real world we have here we have one two three four five six measurements are you going to enter them on six different sheets it's all getting messy this is very simple and if ever you want to transform your data and get the total it just becomes another column down here sf0668: why doesn't he have repeated measurements of so many different factors nm0658: if you've got repea-, lots of repeated measurements they still keep going across sf0668: nm0658: and you might have sets of repeated measurements you might measure lots of things after twenty days another set of things after thirty another set of things after forty they can go across so you can go way across here er sm0669: what if experiment where you take in data from de-, for instance you do maize and pea or whatever it is and then you have all similar columns like this will it make safe to try to box them together or have separate because on them really i mean they're independent nm0658: okay sm0669: sort of shoot up into for say maize height nm0658: okay sm0669: nm0658: right th-, we'll that can be the last question before we have a quick break 'cause you have such a nice new coffee place downstairs er th-, er the question i hope it's understood to everybody there are some experiments which are called mixed cropping experiments does anybody not know what is a mixed cropping experiment you you all happy that this is a an experiment where you might have maize on sub-, some plots and beans on other plots and then the aim of the experiment is to see how the maize and the beans mix together so some plots have both maize and beans together and my simple rule for this is please stay simple so that is that please do exactly the same down here you have all your plots your treatments will say is it maize sole or bean sole or a mixture and then here are all your measurements and in your measurements you will have lots of blanks for the observations that don't the maize observations don't have any beans leave it blank or put it as missing it doesn't matter keep life simple in the past there has been a problem particularly related to a package called Mstat that is very poor at data manipulation and that has caused people to say well maybe i'll enter the maize data in one file and the bean data in another file or maybe i'll enter the maize sole data in one file and the bean sole data in another and the mixed data in a third and then they get very confused er so to avoid all that confusion keep my simple message don't confuse the entry with the analysis the reason people choose the different files is to simplify the analysis simplify the analysis afterwards but for the entry the entry is simple if you say how many plots did i have and i will enter observations and if there are some observations i don't have they're like missing values the reason is different but you just leave it blank or you put a missing value code in let's have a nice simple life and then you will find that the analysis is also simple after the break er we'll look at how to manipulate the data and then quickly on to repeated measures okay we'll have a break nm0658: i'm realizing as usual that luckily you have the slides for everything because i don't think i'm going to finish yet again and i do want to discuss repeated measures so i may have to leave out a little bit of one of the topics on data management i just want quickly to review the ideas to compare these two formats you've seen our recommended way of data entry this is all revision now has one row for each plot or unit and that's the lowest possible level if it's a subplot it's one row for each subplot so in the example in lecture three there are twenty-seven rows because there were three reps by nine treatments there were twenty-seven plots in the randomized one just now there were ninety-six rows because there were ninety-six plots therefore there's one column for each measurement the other way looks like a textbook example for a hand analysis also unfortunately and they've been written too many times but because they're all powerful they don't have to listen to anybody er in Excel it's the way you need to lay out the data for Excel to do an analysis of variance we do not recommend you use analysis of variance in Excel if you're going to do analysis of variance use it in Genstat or Minitab or anything but Excel's analysis of variance is not very good sf0674: statistically it's not very good nm0658: it's it doesn't do enough it doesn't show you residuals sf0674: nm0658: it doesn't encourage a critical looking at data it only goes up to two level factorials and it's much better if you're going to use Excel for your statistical analysis you go up to simple description and tabulation maybe you do graphics in Excel but don't get into ANOVA and regression in Excel you've gone off the end of Excel use the proper tool for the task you have when you have complicated analyses of data there are many statistics packages use the one that's most appropriate for you i think for experimental data it's absolutely clear that Genstat is the most appropriate but the important thing is that you use one that is appropriate and you've fallen off the end of a very general purpose tool which is a spreadsheet [cough] so as i've said before this lecture try to use the standard format for your entry it will save time on the analysis later but if somebody has entered the data differently it's usually quicker to reorganize the data than to have to retype so don't say oh gosh you've made a big mistake you'd better start again use the computer to help reorganize the data use either a statistics package or a spreadsheet i find that Genstat the competition between Genstat and Excel for reorganizing the data for you would be quite a good one because i think for reorganizing if you knew neither package it would be quicker to learn and use Genstat because it's built to do that sort of reorganizing than to use Excel but life is not equal most of you know Excel very well and you hardly know Genstat so some of you will prefer to reorganize your data in Excel and other people would prefer to learn a bit more of Genstat to do the reorganizing you should choose the method that's most appropriate for you and that will depend on your work in the future but if you're thinking of using Genstat seriously in the future then try and learn a bit about its methods for data manipulation and i'll show you a little bit about what that means first you have to understand what you're trying to do here is an example where i've taken the one from before where i assume we've entered the data in the wrong way and we want to go from this way to this way that's the way we would have had if we'd entered it correctly with the replicates and the treatments and the dry matter so you ought to see what it is we are trying to do we want to go i've said from this wide format to the long format we are trying to take these and i think you can see i hope you can see we're trying to stack these one below another so what goes across becomes stacked it's not er only that you could picture that's very easy in Excel but we want to do a little more than that because we want to take this which are already in the right form and we want to repeat that so this one on the left hand side gets repeated three times once for each column so it stays opposite not just this measurement but this measurement and these reps one two and three we want to have a new column called rep which goes one two three so it's not just stacking it's a bit more what do we mean in practice usually you want to stack the measurements and you want to carry the factors let's have a look at how that works in Genstat Genstat has a dialogue called stack you just have to go you should be getting to the stage with Genstat that you become curious and you think well there must be there somewhere and you go to spread manipulate and you will find stack within stack it will say how many columns do you want to stack together well it should be quite easy for you to say i want three columns because i have rep one rep two rep three i want those stacked one below the other i want to record the column source i want to record where they came from in a column called rep that's a new column and then you put your observations which were called rep one rep two rep three which were actually the yields and you put them there you notice it puts a one beside to say they're all going to be one column you might have had more things as well and then it would have twos beside because that would now be a second column and you have a repeat column which is treat you want treat to be repeated each time that's what we said was needed and when you click on okay then you will find that it will produce exactly that spreadsheet so it will take that and produce that i don't think that's too difficult so that is stacking in Genstat once you've got that far and you realize that's quite easy i just occasionally you want to go the other way round not often but sometimes you have it long and you want to go wide and if you ever need that there is you should now not be surprised if there's a stack dialogue there is also an unstack and this is like stacking things on a shelf you you can either stack them up or you can take them from there and you move them sideways so that they're now unstacked so that also exists that is one sort of data manipulation the second one i'm going to mention that it exists and then leave it because i think there is no not time for the repeated measures so let me just mention that the next it's in your notes that the next is that you sometimes have data at two levels and you must summarize one level to move it to another level the example that you have in the notes is data that were measured yields were measured on the plot level but tuber number was measured on the plant level and you want to summarize the tuber number onto the plot level to analyse with your other data and you'll be doing that in the practical one of the examples does that and that is called data summary before the analysis and that's very common and is usually where measurements are made at a lower level than the one where the treatments were applied so here the varieties were applied at the plot level but you measured something on plants within the same plot and so i just give you the three examples that you've had already in session three there were tuber measurements on twenty plants in each plot in session in the last session you had tree measurements this was what namex described on four trees in each plot so you have a plot was four trees and you measured the girth and other measurements on each tree within the plot but you want to analyse at the plot level because that's where you applied your treatments and in the Genstat guide the very simple example that you will have seen in the first part of the guide had four replicates and three treatments and there were twelve pens but there were two sections to each pen and that's the example in the Genstat guide so there we have twelve plots and two sections in a pen so we have twenty-four sections and we want to summarize from the twenty-four up to the twelve now i've put at the bottom check you recognize these these three as the same problem because in the Genstat guide we show you in detail how to solve this problem you will meet it very often and if you recognize that this is the same problem as this and this and many others then you can use Genstat to do that initial summary and let me just show you the the dialogue that you get with Genstat without giving you there's your data at your low level you want to get the total or the mean at the higher level and you want to carry other things along so your analysis can proceed at the higher level and so again you should find there is a summarize the spreadsheet which is another dialogue which will help you when you've got very detailed data at one level and you want to move up to another level so that's in your notes and it'll be in the practical does anybody have any i haven't covered it because i'm a little behind and i would like to tackle the repeated measures does anybody have any comments or questions on that yes sm0675: case we have er four trees er in each plot we know for example the directional reach of the trees nm0658: yes sm0675: so we wanted to know the production for the block separately then we add all the production of the four trees and then er we'd like to know er for the whole plot nm0658: that's correct that that that's very common er you you don't have to and in the notes we describe you can do the analysis at the tree level but you must accept that you applied your treatment at the plot level so the usual thing is to say what was the production for the whole plot and then you must say well do i want the mean production per tree or the total production and that depends on the problem on er some experiments i analysed er on disease there were measurements a an an insecticide was applied at the whole plot and then ten plants were measured to see how diseased they were and the idea was now for each plot you wanted a measure of disease in the olden days people always used the mean disease score as a measure of the disease per plot we decided that because we wanted to see the most effective insecticide we should also calculate the maximum disease score namely the disease score of the worst plant in the plot as a summary number to say that characterizes the worst that could possib-, if that worst is pretty good it's a good insecticide so it's not always usually you have the total or the mean in the plot but it can be the maximum or the minimum that is also useful so you calculate a summary statistic for the plot which you then analyse at the plot level last subject repeated measures these are very common in designed experiments where measurements are repeated on the same unit they can be in time or in space and the problem from a statistical point of view is the same so animals weighed each week would be i'm assuming the animal receives a particular treatment at the beginning of the experiment or has having a diet as you go through the experiment and is weighed each week so rather than just an ordinary yield experiment where you measure just once at the end you record successively in time that is a repeated measure in time most repeated measures are measures in time occasionally here's another example tree diameter is recorded every six months to see about the growth in human experiments you often do lots of measurements in time you measure the weight of babies every month for a year you measure the effectiveness of a treatment for cancer by recording every three months the effect on patients and so on so this is very common in many fields of application just occasionally these repeated measures are in space an example would be if you have a hedge and you want to see the effect of the hedge on the plants that are close to the hedge you might have a hedge er with a certain tree species and then you have some rows of maize which grow and you have six rows going away from the hedge and now you want to measure each row so you applied your plot consists of your hedge with six rows each side and now you want to repeat the measurement namely the yield but not for the whole plot but for each row to see the effect in space as you go further away from the hedge and that is a problem of repeated measures in space those repeated measures introduce problems of data management and analysis which we're going to look at and it reviews many of the ideas from this part of the course we'll have a very simple example and so you have more information it's in the Genstat guide there are five replicates this is an example for you because it's example where there are no blocks one or two er so there are just fifteen petri dishes little dishes and there are three isolates of a fungus and they're repeated five times but there's no group of five here group of five here there are just fifteen of them and there were six measurements made on days three four five six seven and eight [cough] how many plots how many measurements is it clear about the normal data and how many plots how many plots easy question fifteeen sm0676: fifteen nm0658: it is i'm sorry you you probably were confused 'cause it's too easy er it depends whether you think of your your confuse your measurements with your plots you see if you don't it should be obvious you have a question sf0677: destructive measurements on the same plot as the nm0658: the des-, tha-, that's a very good question when you are measuring on the same plot you have the choice of measuring let's say height which is something you can go back to exactly the same plant and measure or harvesting a few plants in the plot from a stats point of view it's roughly the same thing because they are they're still within the same plot but from a precision point of view it's much better if you can i-, i-, it isn't quite the same because if you go back to the same plant then your repeated measure is at the level plant if you're measuring the height whereas if it's destructive and you're measuring let's say the height of four plants and then you throw them away then and you measure the height of four more plants then you're still repeating the measure but the level is the plot level not the plant level because you can't go back to the same plant because it's harvested so where it differs is the level at which you're able to do the repeat usually we find that the lower the level you can do the repeat the better it is so we often find that non-destructive measurements are tremendously useful and last week i was examining somebody whose thesis is on taking aerial photographs of plots where you can measure the the area roughly of each plant repeatedly very very easily by taking a photograph and that is non-destructive and was shown to be a very good way compared to these small harvests that people often take where you get exactly what you want namely the harvest but it destroys it so you can't measure the same plants later on here's the data well sorry er i haven't i'd asked you how many plots there are but it hasn't answered you should be saying there are six measurements so that's six columns the fact they're on days three four five six seven eight doesn't affect and there are fifteen plots therefore i'm going to have fifteen rows of data and so the data are going to look well here's an example of the way the data could look where there's the unit there's the isolate there's the rep and there's my measurements on day three four five six seven eight [cough] now you now have to think of your strategy for the analysis and here we begin with a slight problem that is i would like you to look constructively at the data exploratory analysis we said is very important i wonder how you would like to explore these data well a very common way that people would like to explore the data is to see what's the change in observation over time sort of notice this goes on the first petri dish i go three-point-seven five six-point-one seven-point-five eight-point-three nine- point-ei seem to going up with time to understand my data maybe it would be nice to look at that sort of graph as a function of time that would be a nice exploratory method but unfortunately for a stats package exploration works best on columns so you may wish to do that exploration in Excel or you will find Genstat helps so a strategy for the analysis nothing changes you've changed the problem but you haven't changed the strategy which is please start by looking critically at your data so start with data exploration and that's usually graphs so you look at all the data are there any odd observations you could do those Excel or in Genstat and you could get one graph for each plot so there'd be fifteen graphs be one way of exploring the data let's have a look at that first i'll come back to that for the second part so there's a way in Genstat of exploring the data and you will find that Genstat has a little menu which puts this out automatically so here's the graphs for er one treatment the second treatment and the third treatment and here is the graph for the means for the three treatments so there's a bit of analysis but here we have all our data this is the ninety observations there are fifteen lines because we have fifteen plots and each line has six points because we have six time points so we actually have all our data here and we could see if there was some odd observations i don't see anything particularly odd sm0678: when you say odd what exactly do you mean nm0658: the full set of numbers consists here's all our data these are all the numbers and what we have is we have fifteen plots so we're actually using all the numbers in those graphs you can see every number somewhere there so we're using all our data in producing those plots sm0678: my question is by looking at the graphs what are we looking for nm0658: er okay does anybody have anything they've found without knowing seriously what they're looking for sm0679: there is a nm0658: any impressions sm0679: increase er to an X -axis and nm0658: there seems to be an increase any er re-, remember your chick experiment and the increases sort of straight lines or curves sm0680: curves sm0681: curves nm0658: curves we we sort of like that sm0682: yes nm0658: or wiggeldy sf0683: some of them sm0682: mm some of them were sf0683: and some were straight mm nm0658: i don't notice any that go sort of right like that sort of starting going up to the top and coming down remember we're worrying about statistics there's variation so th-, everything you can't have things that are exactly straight 'cause we're just connecting the points do you do you think that for some of these it would be sensible to have a straight line model would that be a rough reasonable summary sm0684: nm0658: for all the plots are there any plots where you think a straight line model isn't going to be sensible i don't actually see many which is surprising usually you find maybe one treatment is curved and the other treatments are straight as you found with the barley and the wheat one was more curved than another you might want to recognize that does anybody see any very surprising observations i don't i don't i don't sort of see a sudden spike like this which might be a recording error so this is exploration and exploration can be positive or negative usually i find you notice one or two very odd observations here i don't see anything very odd and things seem to be increasing so that here where this is the mean of that one i feel reasonable confidence that drawing a straight line which is the average for those points is probably a reasonable summary and i notice now that this straight line so this is analysis and this is just presentation of the raw data and i now notice in this summary that these all three seem to be going up but this maybe is going up more gradually A it's lower and it's going up more gradually so from these i feel that this is probably a fair summary of the data i don't see any reason to say oh gosh it's not fair because of this and in here i'm starting my analysis and i've done that in a very simple way and visually does that help to answer your question [cough] okay so we have our data and i was looking for the strategy okay so i suggest for all analyses you start with a simple summary and then you go on to simple analyses what could the simple analyses be well here we have the data one simple analysis could be to analyse the data on day three just take one time point another one day eight so there's a very simple analysis you could analyse each of your observations separately the next simple analysis could be to take a useful summary one summary might be the difference between day eight and day three has the change been the same for each treatment and each replicate so that's what we suggest i keep losing the er so i've suggested the first simple analysis could be the data at each time point then we could have simple function like the final minus the initial or we could have the slope of each line now you don't want the slope if too many lines could be very curved but here i think getting the slope of each line might be a sensible summary as a strategy what sort of strategy is this well i would claim that the repeated measures are like observations at a lower level they're not exactly the same but they're a little like a split plot analysis they're like taking day as a level within a petri dish and then in this example we have six observations within each petri dish for the six days or we have ten weights within each animal so it's a little like a split plot experiment where the factor time is within the treatment just as an ordinary split plot but it's not quite a split plot because we don't randomize the times we can't like anything like that the analysis will be simpler if we first get a summary value at the plot level so our analysis is going to be simple if we summarize up to the petri dish whatever we do we've got fifteen petri dishes whatever summary we'd like to get that would be a simple analysis if we can go you were asking about split plots and i was saying simple experiments are at one level well repeated measures bring in a second level there are many methods for analysing repeated measures bringing in the two levels but the simplest is not to have two levels but is to summarize the data from the repeated measures up to the one level because that's where we apply the treatment which summary is appropriate depends on the data and the objectives and the booklet on analysis that i gave out in week eleven gives you some more details so the graphical display indicates that a useful summary might be the slope of the regression line for each petri dish so we then have a problem how do we get the fifteen slopes in Excel we could use the data as they stand in Genstat be better to stack the data so now to do those slopes because Genstat works with columns be better to stack the data and that was shown earlier and once you analyse with the stack data you will find and the way we'd go through that and get the regression you will find that here are the fifteen observations and here are the fifteen slopes and here is the analysis where we're actually analysing the slopes we're getting the individual slopes and there's fourteen degrees of freedom here because there's fifteen petri dishes and we find that er the effect of slope is statistically significant and these are the three slopes which are the lin-, the slopes of those three lines and we find that these first two treatments are about the same but this slope is rather flatter there are many other methods of analysis of repeated measures and Genstat has a whole set of dialogues specially for that they all try and get more levels more information by leaving the data at the two levels rather than summarizing up to one level they're often attractive in principle they're needed that should be an if they're needed if there is not enough data but to me they have a major problem that they are much more complicated and often their real problem is you can't tailor the analysis to the precise objectives of your research so they're like many complicated analyses that they're wonderful in principle but in practice they don't help that they are playing with data often and my conclusion is use the simple methods wherever possible and if you do use the more complex methods which you can because Genstat provides them from menus then don't just use them make sure they add constructively to what you were able to do quite simply with the simple methods okay practical work the practical follows the topics covered in this session so it's useful for you to review those so we're doing some on designing some on managing data and some on repeated measures in each case i've deliberately used examples from the Genstat guides so you don't have to finish just concentrate on those parts you find most interesting and that will help you get more experience in using Genstat and two final slides this is now the end of the five sessions which are specifically for the analysis of experimental data you now should have two things the first is the broad picture of the role of statistics in research projects which has come from sessions one to five that was last term so you should have an idea of how you use statistics in design in data management you should have reviewed basic statistical techniques statistical inference ANOVA simple regression that was the second part last term now you should be familiar with some of the special methods for analysing experimental data and hopefully you are therefore ready for a brief introduction to the role of modern statistical methods to help you in processing your research data the lecture room next week is the plant sciences lecture room so we're not here next week we're together with the other group in plant sciences ss: nm0658: in ss: nm0658: sorry ss: nm0658: the ground floor i pres-, the ground floor lecture room in plant sciences sm0685: you mean the small lecture theatre sm0686: there's one nm0658: i hope not it's w-, sf0687: there's one lecture theatre anyway isn't it nm0658: sorry sf0687: just one there isn't it nm0658: there's just one one lecture theatre i've been told it's got to take seventy of us sf0687: no it's large sm0688: nm0658: so it's the large lecture theatre ss: nm0658: and next week the practical is again for you in the Met department sm0689: is it just for us or will it be everybody else here as well nm0658: everybody is in the lecture and the practical is still split into the two groups as you go can you please we want your critical review of these five lectures and so we have another of the evaluations this is on this session any comments remember we've changed the course a lot this year so any comments they don't have to be polite and i'll collect this in the practical so can you take one as you go out and then i'll collect these in the practical