Don’t Cry Over Spilled Validity


An analysis of A Level English Literature


In his introduction to Testing Times, exemplifying the fitness-for-purpose of a test, Gordon Stobart states that we ‘do not want somebody to be given a driving licence solely on the basis of a theory test’ (2008, p.14). He posits three questions which force a philosophical and practical evaluation of any assessment:

  1. What is the principal purpose of this assessment?
  2. Is the form of the assessment fit-for-purpose?
  3. Does it achieve its purpose? (p.14)

To apply Stobart’s exemplar to the study of English Literature – to equate the doling out of driving licences to the handing out of A Level certificates in English Literature – presents us with more questions. What license is given by bestowing an A Level in English Literature? Does its test equip students to ‘drive’ a deep understanding and application of literary theory at university?

Taking Stobart’s three elemental questions, this article explores the purposes of the study of English Literature in relation to issues of validity and reliability: English is presented as problematic in its breadth of interpretation (both in students’ practices and in teachers’ conceptions of expertise in the subject domain). Attempts to make the evaluation of English Literature more reliable risk narrowing the scope of the domain and compromising its validity. In addition, attempts to systematize evaluation could erode the English professional community upon which rests a judgement of what constitutes evidence of knowledge, skills and understanding in English Literature.


Purposes – what is the purpose of A Level English Literature?

Proposing his ‘three best questions to ask of any assessment’, Stobart (2008, p.14) guards us against being deceived by their simplicity. Unpacking these questions, we find the core theoretical issues that underpin any thoughtful consideration of assessment: validity, reliability and what Stobart calls ‘the spectre of unintended consequences’ (p.14).

Stobart outlines three broad purposes for assessment:

  • selection and certification
  • determining and raising standards
  • formative assessment – assessment for learning. (p.16)

He argues that an assessment can be made to serve multiple purposes: for example, grades at A Level can both certify individual students for entry onto an undergraduate course and determine the standard of tuition of a particular course.

Newton (2010) formulates a list of 22 purposes for assessment. He considers the utility of categorizing assessment purposes and concludes that highlighting the variety of purposes ‘helps to foreground the necessity of tailoring assessment design to assessment purpose’ (2010, p.394). One cannot evaluate the effectiveness of a qualification without first establishing its purpose.

Before considering the purposes of A Level English Literature, it is useful first to consider both the purposes of the study of English and the purposes of the A Level qualification.

Following the Education Reform Act 1988, the Department for Education and Science and the Welsh Office published English for Ages 5 to 16 (more commonly referred to as The Cox Report) in which they sought to define targets and programmes of study for the newly instituted National Curriculum. Its authors identified five purposes for English study: ‘personal growth’, ‘cross-curricular’, ‘adult needs’, ‘cultural heritage’ and ‘cultural analysis’ (1989, p.60). They made it clear that these ‘are not the only possible views, they are not sharply distinguishable, and they are certainly not mutually exclusive’ (p.60).

The process of gathering these views is not defined within the report and these views ignore the practical purposes of qualification and certification, which we will explore in greater depth when we consider the purposes of the A Level qualification in particular. However, these views inform the structure of the subject and illustrate tensions evident within both the English Literature curriculum and assessment of English Literature.

Whilst the Cox Report’s five views of English overlap and inform aspects of both English Language and English Literature, it is English Literature that is most concerned with promoting cultural heritage by fostering ‘detailed knowledge and understanding of individual works of literature’ (DfE, 2014, p.1). In addition, students are asked to ‘identify and consider how attitudes and values are expressed in texts’ (p.3), linking to the ‘cultural analysis’ view. In terms of ‘personal growth’ in both the A Level Subject Criteria and in the curriculum as a whole, there is an expectation that students should develop an ‘interest in and enjoyment of literature’ (p.1).

A clue to the wider purpose of A levels can be found in the Department for Education’s 2010 white paper, The Importance of Teaching: ‘A levels are a crucial way that universities select candidates for their courses, so it is important that these qualifications meet the needs of higher education institutions.’ (DfE, 2010, p.49). Following this logic, one of the primary purposes of A Level English is to certify students for the study of English Literature at undergraduate level. This is echoed in the stated rationale for the reform of A Levels. Quoted in the House of Commons briefing paper, ‘GCSE, AS and A level reform (England)’, the 2013 Schools Minister, David Laws explained that A levels were being reformed in order that they would provide students with ‘the right skills and knowledge to allow them to progress’ (Long, 2017, p.14).

These statements couch A level English Literature firmly within Stobart’s ‘selection and certification’ categorization of assessment purposes (p.16). If so, our determination of whether the English Literature qualification meets its purpose is a question of whether it qualifies students to study related subjects at an undergraduate level.

Newton posits that ‘certification might be read as an implicit qualification purpose’ (2007, p.162): sometimes a loose ‘grading’ and sometimes a more precise indication of ‘the attainment of a general competence or profile of competences’ (p.162). Newton warns that approaching purposes of assessment as discrete and unproblematic risks compromising the validity of such assessment. He explains, using the example of rising average points scores at GCSE that it is ‘questionable whether GCSE results possess the stability of currency’ to support both national accountability and individual qualification as mutually supportive purposes (p.167).  If we regard qualification for undergraduate study as the primary purpose of A Level English Literature, we can then explore issues which might compromise its validity for that purpose.


Fitness-for-purpose: Validity

Definitions of validity

Before establishing whether A Level English Literature is a valid measure of preparedness for undergraduate study, it is necessary to consider the contested nature of the term validity and how this might influence the breadth of our analysis.

Messick (1989) argues that the definition of validity cannot be considered solely in terms of the properties of a test. He defines content validity as ‘judgmental evidence in support of the domain relevance and representativeness of the content of the test instrument’ (1989, p.7). In short, does the test reflect appropriately the content of the curriculum?

Messick defines criterion validity as ‘selected relationships with measures deemed criterial for a particular applied purpose in a specific applied setting’ (p.7). Whereas content validity is preoccupied with what has been taught and remembered, criterion validity is more concerned with how a student can demonstrate proficiency. Here we can see a deepening sophistication in the evaluation of validity: we are no longer concerned simply with the validity of the content of the test; we can now widen our concern to consideration of the criteria for evaluation – and that is included within our wider consideration of the test’s overall validity. Even though an English Literature paper contains extracts from taught texts, previously unseen texts and questions related to the taught curriculum, this does not constitute an unproblematic measure of validity.  Messick argues for ‘construct validity’ which integrates ‘any evidence that bears on the interpretation or meaning of the test scores’ (p.7).

Cronbach (1971, quoted in Newton [2013, p.3]) problematizes the idea of validity as a discrete property of a particular test; he argues that ‘one validates, not a test, but an interpretation of data arising from a specified procedure’. This guards us against evaluating a test’s validity purely on the grounds of its content and process.

Koretz (2009) defines validity not simply as an interpretation, but as an inference that exists on a continuum: ‘inferences are rarely perfectly valid. The question to ask is how well-supported the conclusion is.” (p.31). He argues that not simply testing, but test preparation is part of the consideration of a test’s overall validity: ‘What matters is the inference from the tested sample to proficiency in the domain, and any form of test preparation that weakens that link undermines the validity of conclusions based on scores’ (p.34). To refine our earlier question: to what extent does a particular test give grounds for evidence of such proficiency in the academic domain of the study of English Literature?


Is A Level English Literature a valid test?

In England, A Level English Literature examinations are regulated by the Office of Qualifications and Examinations Regulation (Ofqual). Whilst the Department for Education determine the subject content via national curricular documentation, it is Ofqual who ‘provide the framework within which the awarding organisation creates the detail of the specification’ (Ofqual, 2011a, p.3). There are currently four main awarding organisations (exam boards) for A Level English Literature in England: AQA; Edexcel (a subsidiary of the Pearson publishing group); OCR (part of the Cambridge Assessment Group); and WJEC Eduqas, the Welsh Joint Education Committee’s brand for offering qualifications to English schools.

For the purpose of assessing the qualification’s validity, we will explore the major changes to the qualification from 2010, when reforms were first announced (Long, 2017, p.3) to the present day. We will also consider these reforms in relation to the aforementioned contested purposes of English as first defined in The Cox Report.


From 2015, AS and A Levels were decoupled, and by 2018 all A and AS Levels will be taught in a linear fashion with terminal examinations at the end of each course. The change from modular units to a linear approach was justified in parliamentary debate by the 2013 Schools Minister, David Laws: ‘[Students] and their teachers have spent too much time thinking about exams and re-sitting them, encouraging in some cases a “learn and forget” approach’ (Long, 2017, p.14).

Reforms to the structure of the qualifications were accompanied by more detailed changes to the subject criteria for English Literature at A Level. In 2011, Ofqual published a comparative review of standards across exam boards in the provision of A Level English Literature between 2005 and 2009. One of their key findings was that ‘the demand of qualifications varied’ due to ‘the amount of choice available to candidates within the specifications and through varied schemes of assessment (including the opportunities to choose between coursework and question papers)’ (Ofqual, 2011b, p.3). This has been addressed in the 2015 A Level by imposing a mandatory level of 20% assessed coursework for all exam boards.

Cushing et al’s (2015) review of the new English A Levels identified four common changes to the qualification: ‘the reduction of coursework to 20%, the reduction of the number of set texts from 12 to 8, and the requirement for all specs to have an unseen element in the examination’ (p. 24). In addition, each board’s exam specification must ‘specify at least 3 pre-1900 texts out of the total of 8, and one of the 8 must be post-2000’ (p.24).

What is most notable about the reforms is an attempt to standardize provision across exam boards and restrict the number of pathways through to certification. Reducing the number of re-sits, harmonising the proportion of coursework and mandating an unseen element in each English Literature exam: these efforts address issues of reliability in assessment by reducing the variability of assessment practices between exam boards for the same qualification. However, as Koretz argued above, all aspects of a test impact upon validity. Harlen (2007) goes further to assert that the ‘key interaction is between reliability and validity’ and that ‘an assessment cannot have high validity and high reliability’ (p.21). Reforms designed to strengthen the reliability of a qualification have an impact upon its validity. Harlen uses the term ‘dependability’ to show ‘the extent to which reliability is optimised while ensuring validity’ (p.21). He warns that ‘attempts to increase reliability by standardizing the tasks that are assessed by teachers lead to narrow artificial tasks of low validity’ (p.21).

An example of this narrowing is evident in Ofqual’s review of A Level English Literature which concludes in part that ‘formulaic questions in some specifications reduced demand’ (2011b, p.3). Teachers familiar with particular exam specifications become accustomed to particular question types that re-emerge year after year (for example, OCR’s repeated use of a particular question on ‘The Franklin’s Tale’ [p.15]), giving rise to explicit preparation for particular types of formulaic written response.

How the contested purposes of English impact upon test validity

Marshall (2011) argues that in English ‘judgement is practised and criticism exercised’ (p.10): there is a complex interplay between delivery and acquisition that is particular to the subject. She invokes Dewey in order to claim English as an arts-based rather than a technical subject: ‘There is all the difference in the world between having to say something and having something to say’ (Dewey quoted in Marshall, 2011, p.11).

The A Level calls for such practised judgement and rewards sophisticated originality. For example, an Edexcel Drama Paper 1 mark scheme rewards ‘a critical evaluative argument’ that displays ‘sophisticated understanding of the writer’s craft’ (Pearson, 2017, p.4). The link between the curriculum and the assessment is straightforward, appearing to ask for the type of critical evaluation that would be called for in the curriculum. The curriculum asks for wide independent reading, the application of ‘knowledge of literary analysis and evaluation’, critical engagement and contextual awareness (DfE, 2014). The resulting exam specifications provide ‘an adequate prerequisite for students wishing to pursue a range of English related degrees in [Higher Education]’ (Cushing et al, 2015, p.34). But is this enough evidence to assure us that A Level English Literature is a valid test?

In a response to Newton’s (2012) attempt to clarify the definition of validity in relation to testing, Pollitt (2012) makes a bold and interesting assertion relevant to our analysis: ‘validity can only be created at the very beginning of the process – after that, validity can never be created, it can only be lost’ (p.102). He charts the assessment process as a gradual erosion of validity: ‘The perfect validity with which the enterprise began will be diminished at each step; with a lot of cleverness, an awful lot of effort and attention to detail, and a bit of luck, there will still be sufficient validity in the procedure when the final user makes her interpretations’ (p.102). In the case of English Literature, the test might reflect the curriculum, but processes that precede and succeed the test all have an impact its fitness-for-purpose.

To return to Messick’s (1989) distinction between content and criterion validity, we can begin to unpick the extent to which issues with A Level English Literature may not sit simply with the tests themselves, but with the breadth of possible interpretation of their criteria. In fact, a simple evaluation of the content validity of English Literature would be deceptively unproblematic: the test tests what is taught.

What is problematic is the interpretation of the criteria, which is symptomatic of a wider issue with English Literature as a subject: the sheer breadth of textual interpretation and what constitutes a ‘good answer’.


Fitness-for-purpose: Reliability

Definitions of reliability

Harlen (2007) defines reliability as ‘the extent to which the [test] results can be said to be of acceptable consistency or accuracy for a particular use’ (p.20). Ofqual defines reliability in relation to educational assessment as ‘the extent to which a candidate would get the same result if the testing procedure was repeated’ (2013, p.2). It defines reliability as ‘an essential aspect of validity’ (p.4). This is an important acknowledgement that reliability in testing is a part of its validity – it cannot be separated from any consideration of validity; nor can steps to improve the reliability of testing be taken without due consideration of their impact upon validity. Baird and Black (2013) describe the relationship between reliability and validity as ‘a prevailing tension’ (p.4) and warn against the pursuit of a perfect reliability. They argue that public examinations in England ‘have been accused of fostering narrow teaching to the test’ (p.3), echoing the earlier quoted criticism voiced by the 2013 Schools Minister, David Laws. They go on to conclude that the pursuit of perfect reliability ‘may actually worsen the situation’ (p.4).

Wiliam (2001) uses a ‘stage lighting’ metaphor (p.21) to clarify the relationship between reliability and validity. If we illuminate a small area of staging brightly, we get high reliability but complete darkness surrounding the one patch of bright light: we learn a lot about one thing at the expense of the surrounding context. Conversely, if we illuminate the entire stage, we see everything in broad terms but have ‘no clear detail anywhere (low reliability)’ (p.21). If we want both high reliability and high validity, we would need to assess over much longer periods of time. He calculates, for example, that if we wanted to improve the reliability of Key Stage 2 tests ‘we should need to increase the length of the tests in each subject to over 30 hours’ (p.19). There is an acceptable measurement error in pragmatic assessment that can only be reduced by stretching the procedure of a test to impractical limits. This is supported in both Harlen’s (2007) ‘acceptable consistency’ and in Ofqual’s own assertion that an ‘internal consistency level of over 0.85 is normally considered an acceptable level of internal reliability’ (Ofqual, 2013, p.6). In short, this means that it is generally accepted that in a single test, 85% of a student’s score should reflect their ‘true score’ (a hypothetical attainment based on their expected score over a protracted number of varying tests) and 15% would be subject to measurement error.

The interplay between validity and reliability, alongside the acceptance of a level of measurement error is crucial to our evaluation of A Level English Literature since its reliability needs to be appreciated within the context of accepted error and pragmatic considerations of what is both possible and desirable in assessment.


Is A Level English Literature a reliable test?

The key measure of a test’s reliability is the extent to which it reflects a ‘true score’ within an acceptable margin of error. Whilst we may not be able to establish a clear link between such a ‘true score’ and an assessed outcome, we can look at the relationship between predicted and actual grades.

Williams & Williams (2017) cite Ofqual’s 2016 figures for Enquiries about Results (EaR) – the procedure through which students and schools can request remarking if actual results are felt not to reflect predicted grades – showing that English Literature ‘topped all three tables (AS and A-level and GCSE) for results which were raised on appeal by two grades’ (p.256). This does not tell us that the test is unreliable, but it does indicate, as Williams & Williams argue, that English enjoys a wider ‘outcome space’ than other subjects: its 40-mark essays giving rise to ‘the widest range of possible acceptable candidate responses’ (p.255). Efforts to narrow this outcome space without compromising the validity of the subject have plagued assessment of English since the standardization of its assessment was first provisioned in the Butler Education Act of 1944 (Marshall, 2011, p.13).

Marshall’s (2011) detailed history of the evolution of English assessment in the United Kingdom reveals a testing regime that is held in tension between impression-based and analytical approaches to marking: she relates how ‘English teachers have a tendency to want to consider the whole of a piece rather than look at its constituent parts’ (p.12). Elliott (2017) makes mention of James Britton’s 1964 research into impression marking, in which he concluded that a process of using multiple impression markers would be both more reliable and valid than exam boards’ prevailing use of analytical marking grids, which were deemed to be too restrictive and to give rise to formulaic responses.

The current process for examining A Level English papers is summarized in Elliott’s (2017) study of examiner training. She conducted a discourse analysis of two examiner training meetings and concluded that examiners identify ‘safe and unsafe signals to follow in terms of the characteristics which are supposed to be representative of a given level’ (p.72). Examiners cultivate a bank of ‘imagined representative characteristics’ (p.70) which inform their judgement. Like Marshall before her, Elliott links this ‘representative’ impression of ‘what a good one looks like’ to both Sadler’s (1989) definition of ‘guild knowledge’ and Wiliam’s distinction between content, criterion and construct-referenced assessment (Elliott, 2017, p.62).

In defining guild knowledge, Sadler said: ‘experienced teachers carry with them a history of previous qualitative judgments, and where teachers exchange student work among themselves or collaborate in making assessments, the ability to make sound qualitative judgments constitutes a form of guild knowledge’ (1989, p.126). Therefore, uniformity of judgement in English assessment is dependent upon a shared understanding of what constitutes performance. The reliability of assessment is inextricably linked to teachers building ‘guild knowledge by shared assessment and discussion, and through a set of shared values of what it means to be ‘good’ at the subject’ (Elliott, 2017, p.62).

It is tempting to conclude that the reliability of A Level English Literature is, if not in question, then certainly at risk: the assessment of English Literature has a wide outcome space that might only be navigable by examiners with rich guild knowledge; the graded outcomes of English Literature are the most contested, in terms of Enquiries about Results (EaR); and the four different exam boards offering four subtly variable versions of the same exam specification.

There are, however, a range of mechanisms in play that seek to mitigate against the potential erosion of reliability. The 2011 implemention of ‘comparable outcomes’ by Ofqual sought to ensure that grade boundaries were set in broad alignment with academic performance in prior years and has resulted in discouraging ‘the historical tendency to set grade boundaries a mark lower than is recommended statistically’ (Benton, 2016, p.17). In addition, the aforementioned EaR process could be considered as a constituent in the ‘dependability’ of the overall assessment as understood by Harlen (2007): the very fact that agreement can be sought between a predicted and actual grade could be said to strengthen the reliability of the overall assessment programme.

Recent research into the use of comparative judgement – ‘judging the relative quality of two different sets of work’ (Benton & Elliott, 2016, p.353) – may go some way to reducing the outcome space in English assessment by simply averaging judgement across multiple expert markers.

It is tempting to be reassured by the employment of such systematic routines and statistical models, but as Williams and Williams point out, ‘the human aspects of examining keep re-surfacing’ (2017, p.270). The quality of examining is founded upon deep and rich guild knowledge. Unless this guild knowledge is preserved, the mitigating use of any such practices would represent little more than an average of ill-informed judgement constrained by the grades given in the previous year.



To paraphrase Harlen (2007, p.21), A Level English Literature might be deemed ‘dependable’ in that it balances validity and reliability – in terms of providing a snapshot of performance in literary analysis across a range of texts both seen and unseen, in addition to providing opportunities for students to engage in longer-term coursework projects.

However, its reliability rests largely upon guild knowledge: the sum of all possible responses to a literary text shared and evaluated by English teaching professionals. Crucial to the maintenance of its reliability – and by extension its validity – is the maintenance of this community of English teaching professionals.

Any drive to improve the reliability of the marking of English Literature papers, whether technological, procedural or both, should go in hand with the preservation and enhancement of the professional community, both generally and more specifically during the process of setting standards and training examiners.

Returning to Pollitt (2012, p.102), measures to enhance the reliability of English assessment can both narrow the tested scope of its domain and erode the networks of its professional community: its validity is there for the losing.




Benton, T. (2016). Comparable Outcomes: Scourge or Scapegoat? Cambridge Assessment Research Report. Cambridge Assessment: Cambridge.

Benton, T. and Elliott, G. ‘The reliability of setting grade boundaries using comparative judgement’ in Research Papers in Education, 31:3. Routledge: London.

Cushing, I., Giovanelli, M., Snapper, G. (2015). ‘The New English A Levels 2015: a guide to the specs’ in Teaching English, 7. National Association for the Teaching of English (NATE): Sheffield.

DES and Welsh Office (1989). English for Ages 5 to 16 (The Cox Report). HMSO: London. Accessed via:

Department for Education (DfE). (2010). The Importance of Teaching: The Schools White Paper 2010. The Stationery Office: London.

Department for Education (DfE). (2014). GCE AS and A level subject content for English Literature. DFE: London. Accessed via:

Elliott, V. (2017). ‘What does a good one look like? Marking A-level English scripts in relation to others’ in English in Education, 51:1. National Association for the Teaching of English (NATE): Sheffield.

Harlen, W. (2007). Assessment of Learning. Sage: London.

Koretz, D. (2009). ‘What is a Test?’ pp. 16-34 in Measuring Up. Harvard University Press: Cambridge MA.

Long, R. (2017). Briefing Paper: GCSE, AS and A level reform (England). Parliament Commons Library: London. Accessed via:

Marshall, B. (2011). Testing English. Continuum: London.

Messick, S. (1989). ‘Meaning and Values in Test Validation: The Science and Ethics of Assessment’ in Educational Researcher, 18:2. American Educational Research Association: Washington DC.

Newton, P. (2007). ‘Clarifying the purposes of educational assessment’ in Assessment in Education, 14:2. Routledge: London.

Newton, P. (2010). ‘The Multiple Purposes of Assessment’ in International Encyclopedia of Education. Elsevier: Oxford.

Newton, P. (2012). ‘Clarifying the Consensus Definition of Validity’ in Measurement: Interdisciplinary Research & Perspective, 10:1-2. Routledge: London.

Ofqual (2011a). GCE AS and A Level Subject Criteria for English Literature. Office of Qualifications and Examinations Regulation: Coventry.

Ofqual (2011b). Review of Standards in GCE A Level English Literature. Office of Qualifications and Examinations Regulation: Coventry.

Ofqual (2013). Research and analysis: Introduction to the concept of reliability. Office of Qualifications and Examinations Regulation: Coventry. Accessed via:

Pearson (2017). Mark Scheme (Results) Summer 2017 Pearson Edexcel GCE in English Literature (9ET0_1) Paper 1: Drama. Pearson: London. Accessed via:

Pollitt, A. (2012). ‘Validity Cannot Be Created, It Can Only Be Lost’ in Measurement: Interdisciplinary Research & Perspective, 10:1-2. Routledge: London.

Sadler, D. S. (1989). ‘Formative assessment and the design of instructional systems’ in Instructional Science, vol. 18. Kluwer: Dordrecht.

Stobart, G. (2008). Testing Times. Routledge: London.

Wiliam, D. (2001). ‘Reliability, validity, and all that jazz’ in Education 3-13, 29:3. Association for the Study of Primary Education: London.

Williams, Y and Williams, D. (2017). ‘How accurate can A Level English Literature marking be?’ in English in Education, 51:3. National Association for the Teaching of English (NATE): Sheffield.


Was it something I said?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s