|
I. WHY ARE STANDARDS IMPORTANT? II. HOW ARE STANDARDS CREATED IN OTHER
JURISDICTIONS?
III. HOW DO SASKATCHEWAN STAKEHOLDERS
CREATE STANDARDS?
IV. HOW DO WE DETERMINE THE REASONABLENESS OF A STANDARD AND OF A DECISION? |
Trustees, school administrators and teachers are given
responsibility for making fair and reliable decisions in a wide range of
educational endeavour. Standards are part of the process of making
choices that affect the lives of students, teachers, support staff, parents,
and community. Because of their impact, standards and the decisions
made with them must be set consistently, prudently and fairly.
The purposes of this study were threefold: to explore
issues in setting educational standards; to analyze the process by which
stakeholders define provincial standards in literacy; to describe a legal
and ethical framework for determining the reasonableness of a decision
involving values.
Part I of this report describes the importance of standards. Part II surveys the ways in which standards are created in other parts of North America. Part III outlines the method used in Saskatchewan for setting standards. Part IV explores the notion of reasonableness in administrative decision-making. This document concludes with advice for educators in setting reasonable standards and making trustworthy evaluative judgments. |
Standards do not drop from the heavens in tablet
form. Rather they are made by human beings with their feet on earth.
This fact must be kept in mind as questions about standards or expectations
for schools, students, educators and school officials become an issue in
Saskatchewan, across Canada, and throughout North America. The popular
rhetoric of educational reform is increasingly coloured with the terminology
of standards as public policy views shift from an exclusive focus on inputs
into schools-- such as grants allocated, curricula produced, pupil-teacher
ratios, and teacher qualifications -- to consideration of outcomes, such
as student achievement, graduation rates, and drop-out rates.
At one time, standard-setting – the process of defining
points for educational decision-making – was primarily of interest to measurement
specialists and psychometricians. Now on both the national and provincial
levels, educational administrators and policy-makers are engaging both
educators and noneducators in formal exercises to define acceptable and/or
desirable levels of student performance for the school system. As
such, the process of setting educational standards can be seen from a number
of perspectives. Standard-setting can be viewed as a(n):
Standard-setting in the United States
Canadian interest in the standards issue has been
kindled by both the media in this country, and by an often heated public
debate south of the border. Midway through his first term, President
George Bush, who had pledged to be the "education president" in response
to poor American results from international assessments, summoned the governors
of the fifty states to Charlottesville, Virginia with the aim of elevating
education to the top of the national agenda. There, the chief executives
pledged themselves to six national goals for education, ranging from a
high school graduation rate of 90% to American pre-eminence in the world
in mathematics and science achievement.
Originally dubbed America 2000 and codified into
Goals 2000 by the Clinton administration, the goals have been supported
on both sides of Congress. Legislators have reauthorized the Elementary
and Secondary Education Act which makes federal education funding to the
states contingent on conformity with a national system of standards and
assessments. A National Goals Report was created in 1991, and updated
in 1995, to show state and national progress toward the six national educational
goals, including states' performance on the National Assessment of Educational
Progress. So, too, has Goals 2000 spawned a series of panels and
bodies, ranging from the 1994 National Education Standards and Improvement
Council, to a New Standards Project centred at the University of Pittsburgh,
to the National Board for Professional Teaching Standards, to the National
Council for Accreditation of Teacher Education, to the National Council
of Teachers of Mathematics, in defining a variety of educational standards
(Rothman, 1995). President Clinton recently lent new vibrancy to
the movement at a National Education Summit, telling American governors
and business leaders that, "We can only do better with tougher standards
and better assessments, and you should set the standards" (American Educator,
1996, p. 11). At present, forty-eight of the fifty states have
developed, or are in the process of creating standards (Willis, 1997).
In the accompanying debate, supporters claim that
standards can improve student achievement by clearly defining subject matter
content and specifying desired performance (Taylor, 1994). Explicit
standards lend coherence to the educational system and clarify the work
of teachers, curriculum writers, educational institutions, software designers,
and test experts. Moreover, proponents argue that standards establish
the principle of equality of opportunity and provide "consumer protection"
by supplying accurate information to students and parents (Ravitch, 1995).
Detractors argue that standards are exclusionary and detrimental to the
multicultural character of North American society (Aronowitz, 1996).
Furthermore, they undermine local control of education (Gittell, 1996)
and promote further secularization of schooling (Berube, 1996).
Three categories of standards have been identified
in this debate. Content standards describe what teachers are
supposed to teach and students are expected to learn, and include an emphasis
on learning subject matter through critical-thinking and problem-solving
skills. Opportunity-to-learn or delivery standards define
the resources, conditions and desirable processes of learning that the
education system is to provide to ensure equality of opportunity to learn
(Howe, 1994; Porter, 1995). Performance or outcome standards
define degrees of student mastery or attainment considered to be satisfactory.
If content standards relate to the quality of curriculum inputs, and opportunity
standards relate to the processes and conditions in school systems, performance
standards describe inadequate, acceptable or outstanding accomplishment
in student outcomes (Ravitch, 1995; Lewis, 1995). While most effort
has focused on content standards in Saskatchewan and the United States
with the development of curricula, experts believe that performance or
outcome standards will increase in importance because of the high cost
of remediation (Willis, 1997).
Yet there are significant differences between standard-setting
in Canada and the United States, where minimal competency testing became
popular as a prerequisite to high school graduation in the 1970s.
In Canada, standards are increasingly required by policy-makers to define
public expectations for student performance in programs or institutions.
Rather than making "high-stakes" decisions about individual students and
their life chances, such as Grade 12 exit exams, provincial ministries
are setting standards for making judgments about systemic rather than individual
performance. The performance standards, along with scoring rubrics,
results and exemplars of student performance are subsequently held up for
emulation, and not yet generally used in direct application to determine
individual student marks or to make program placement decisions.
In the United States, standards are usually considered
narrowly for a specific testing situation, rather than in terms of a larger-
or longer-term framework. Education indicator systems are in vogue
in Canada as provincial governments address issues of public accountability.
In Saskatchewan, the Provincial Education Indicators Program annually publishes
context, process and outcomes data about the performance of the provincial
education system. In a parallel fashion, on the national level, the
Council of Ministers of Education, Canada is developing a Pan-Canadian
Education Indicators Program as a comprehensive monitoring system.
It will encompase achievement, student flows, satisfaction measures, citizenship
behaviours and a variety of other gauges of school system effectiveness
across the country. Changing standards thus become valuable indicators.
If panelists are drawn from the same constituencies over a series of test
cycles, and if one employs the same instrumentation over these cycles,
then the standards may be conceived as incarnating or embodying a set of
public expectations for student or school performance at given points in
time. In this sense, setting performance standards becomes a sociometric
as well as a psychometric technique.
Likewise, with large-scale assessment programs operating
on recurrent cycles, and standards set with each assessment, new conceptions
of a performance standard are necessary. When many people hear the word
“standard”, they tend to think of something etched in granite. Rather
than remaining a fixed, static and enduring entity, a standard is now an
evolving point of comparison that may or should adjust from one test cycle
to another, as test circumstances, test populations, and test questions
change. Even the gold standard varies. Moreover, panelists
themselves, even when drawn from similar constituencies, may provide varying
judgments as circumstances change. The role of precedent thus becomes
important, not as a way of tempering but of temporally linking judgments
to establish continuity between test cycles.
Although Canadian interest in standards mirrors
that south of the border, its origins are home-grown. In 1987, public
officials were alarmed by a Southam News literacy study that pretended
to show poor educational outcomes from Canadian schools. This was
a sensitive point because the Canada-US free trade agreement and globalization
meant that a national economy depended on a highly skilled labour force
rather than on tariff barriers. This explained the Economic Council
of Canada’s swan-song report, A Lot to Learn (1992), which called
for a more coherent education system linking employers, schools and governments
to boost standards and to produce graduates better equipped for a more
competitive work world.
Such was the climate when the CMEC decided in 1992
to conduct annual pan-Canadian assessments to determine 13-year old and
16-year old student competencies in basic skills, in both official languages.
The first round of the School Achievement Indicators Program was conducted
in mathematics in 1993, and in reading and writing in 1994. In 1996
the program expanded to encompass science, to parallel an interprovincial
accord to develop a national science curriculum framework. Simultaneously,
the CMEC with Statistics Canada has begun to develop a Pan Canadian Education
Indicators Program to collect a wider array of information about the performance
of education systems across the country. The first biennial Report
on Education in Canada was released in November 1995, and a Pan
Canadian Education Indicators Report was released in November 1996,
as forerunners of a national reporting system.
Since then, the CMEC has defined criterion-referenced,
performance expectations for the 1996 SAIP Science, 1997 Mathematics, and
1998 Reading and Writing assessments. And it will establish performance
expectations for all future assessments (Council of Ministers of Education,
Canada, 1997). As carried out in each of the 1996, 1997 and 1998
assessments, the exercises have involved approximately 85 educators and
noneducators who were empanelled in one of four regional sessions across
Canada. Those participating have answered the question: "What percentage
of students should achieve each performance level and above" for those
test components involved in each assessment. The expectations of
individual judges, who were selected from stakeholder groups in every province,
were aggregated and equally weighted to derive a median that has become
the first national performance standard in three key subject areas.
The Pan Canadian standard describes expected performance for Canadian 13-
and 16-year-old students of science and has been used to clarify the work
of Departments of Education across the country.
Standard-setting in Other Provinces
Even though provincial standard-setting exercises
have not been extensively studied in the scholarly literature, pioneering
work in four provinces is described in public documents. British
Columbia's learning assessment program began in 1976, and has consistently
employed "interpretation panels" of teachers to judge grade level student
performance in various dimensions of mathematical, scientific and communications
skill, depending on the assessment. Although the procedure has varied
from one assessment to another, constants have included the exclusive use
of professionals as judges, preparatory recording of expectations as estimates
for provincial performance on individual test items according to "acceptable"
and "desirable" categories, and subsequent formal summary and consensual
judgments of performance according to 4- to 6-point scales ranging from
unsatisfactory to excellent by the empanelled judges.
Alberta’s educational standards are, unlike most
other provinces, used for “high-stakes” purposes in multiple grade levels.
Our western neighbour has aimed "to widen the process of setting assessment
standards as much as possible over previous years and especially to provide
for community input and feedback". To that end, five committees have
been struck, as part of the Provincial Achievement Testing Program, to
define two standards in relation to the curriculum being tested.
These committees are composed of curriculum and test developers, educational
administrators, teachers from across the province, psychometricians and
statisticians, as well as representatives from professional, business and
community organizations. Each committee is challenged to determine
what score a student must obtain, or how many questions a student must
answer correctly, to be judged as having achieved an acceptable and excellent
standard. A summary standard is determined by a Final Standards Review
Committee, using provisional standards, review commentary, and representatives
from the original five committees.
While the British Columbia and Alberta ministries
define standards in relation to large-scale assessment results, the Toronto
Board of Education's Benchmarks Program avoided evaluating its student
population against external standards associated with a testing program.
Rather, more than 100 Benchmarks for language arts and mathematics have
been developed as model activities for teacher emulation in the classroom
setting. Based on provincial and system objectives, developed and
field-tested informally by teacher committees, and emphasizing complex
but observable tasks, the Benchmarks set out performance levels and criteria,
but not standards (Larter, 1991).
Benchmarks differ from standards by the amount of
authority invested in the latter. While benchmarks describe representative
performance for the general purpose of professional guidance, a performance
standard has consequences attached to it as a point of educational or administrative
decision making. The benchmark score on a test may be 65%, but if
test results fall below the standard of 50%, then a student fails or is
assigned to a different program. Because a decision, action or consequence
flows from the user's application of information in relation to a standard,
a greater duty or administrative responsibility attaches to it than to
a benchmark, which serves largely as a point of professional reference.
Standard-setting in Saskatchewan
The Saskatchewan Department of Education has sponsored
several standard-setting sessions since 1993 as part of its large-scale
Curriculum Evaluation and Provincial Learning Assessment Programs.
Standards have been set in three curriculum evaluations for the key learnings
prescribed in new Core Curricula, by representative panels of teachers
following student assessment. In three-round exercises using modifications
of an American method, the teachers have been asked to estimate the percentage
of students who would attain each of five levels of performance, considering
the number of years the curriculum has been implemented, the difficulty
of test questions, and the degree of mastery sought by a curriculum.
By contrast, five Provincial Learning Assessments conducted with reading
and writing in 1994 and 1996, mathematics in 1995 and 1997, and listening
and speaking in 1998, have empanelled both educators and noneducators in
multi-round voting exercises. Panelists have been asked what percentage
of students should be expected to attain three or five performance levels.
Whereas the curriculum standards have been set in relation to learning
objectives in curriculum guides, the standards associated with learning
assessments have been based on broader, foundational objectives for English
language arts and numeracy.
Standard-setting in Saskatchewan stems from the
work of a Minister's Advisory Committee that reviewed high school education
in the province in 1994. It identified five types of standards relating
to student evaluation, and called for an equitable province-wide assessment
process for Grade 12 student outcomes. Criterion-referenced standards
were suggested as the most appropriate type of standard, combined with
a benchmarks system that would identify minimal, acceptable performance
levels. Yet significant dissent was expressed within the committee
when it made recommendations relating to standards and testing. A
business representative called for universal testing to extend beyond the
Grade 12 level, while Aboriginal committee members opposed the use of standardized,
paper-and-pencil tests as incongruent with the diverse school situations
in the province (High School Review Advisory Committee, 1994).
A 1996 symposium sponsored by the Saskatchewan School
Trustees Association amplified these conflicting perspectives, and showcased
the kaleidoscope of opinion in the province about the standards issue (Saskatchewan
School Trustees Association, 1996). An official of the Canadian Federation
of Independent Business asserted that "educational standards are important
to Saskatchewan business to ensure that minimum competencies, understandings
and skills are consistently assured by graduates of our school system as
part of a quality labour force" (p. 5). He was admittedly blunt in
reporting that "business people do not want to do some of the 'most basic
product recall work' on behalf of our educational factories" (p. 5).
Likewise, a trustee speaking on behalf of the Saskatchewan School Trustees
Association advised those in attendance that "we must agree on accountability
measures that will tell us how well students are meeting objectives.
If we don't develop such measures, outside pressures will force them upon
us [...] Standards will help us answer the question, ‘How do we know we
are doing a good job?’ "(p. 65). Similarly, a Saskatchewan Department
of Education official asserted that "it is virtually impossible to argue
we shouldn't have standards" (p. 42), but emphasized that opportunity to
learn and content standards are perhaps more important than focusing on
outcomes.
However, many doubts were expressed in the January
forum. Speaking on behalf of the League of Educational Administrators,
Directors and Superintendents, one administrator cautioned that the province
must remain "loyal to standards development that stresses the processes
of learning" as opposed to only product skills (p. 63). For a Saskatchewan
Teachers' Federation representative at the forum, the call for standards
was misguided and contrary to provincial approaches to education.
Likening teacher-student bonds to a farmer's attachment to the land, she
asserted that, "The relationship between teachers and students, at its
best, is a marvellous, even sacred thing. It cannot be captured in
lists of outcomes, in scope-and-sequence charts, in taxonomies of standards,
in rubrics" (p. 60).
In general, both administrators and evaluators must
answer five key questions when designing standard-setting exercises.
These issues may be summarized by the journalistic device of asking who,
what, how, where and when?
The most important question is who should set the
standard? Is an educational standard a professional responsibility,
a bureaucratic creation, or a social construction? Many scholars
(Shepard, 1980, Hambleton & Powell, 1983; Jaeger, 1978) suggest that
standard-setters be drawn from different constituencies, so that the standard-setting
process can systematically represent different value positions and areas
of interest. Yet few specific guidelines have been formulated either
for selecting these panelists, or for meaningfully incorporating educational
stakeholder groups into a standard-setting process. The underlying
issue revolves around the degree to which should judges have expertise
in the subject matter being tested, experience in the curriculum design
and instructional policies that prevail in schools, knowledge of the attributes
of the population being tested, or an understanding of the maturational
possibilities of youth? Likewise, we do not know whether a panel
of classroom teachers will produce more appropriate standards than a mixed
panel of educators and non-educators. Some suggest a standard produced
by a blue-ribbon panel may be more credible than that produced by an anonymous
jury. Others argue that the panel should consist of those who have a stake
in the decisions that result from the standard that is defined, and not
only those who understand student competence or potential.
Second, what is the nature of the standard?
Should it represent a short-term target, an ideal, or a realistic estimate
in terms of the current range of student skill or ability? The nature
of the standard is determined by the wording of the question that standard-setters
answer. In American judgmental processes for establishing minimal
competencies, panelists are asked to estimate the percentages of
students who "would" answer a test question correctly, as in Angoff's method,
or the percentages of students who "should" answer a question correctly,
as in Jaeger's method. A "would" question produces a realistic standard
that defines anticipated student achievement in light of the evidence which
has been assembled. A "should" question, on the other hand, asks
for a formulation of student potential in optimal circumstances, and thus
produces an idealistic standard. Rather than anticipating performance,
a "should" question may ask panelists to provide aspirations rather than
estimates. Originally, the term "desire" meant "to expect from the
stars." As a target to aim for, the "should" standard may become
unattainable. Groucho Marx stated this problem succinctly when he
quipped, “I have my standards, and some day I hope to live up to them”.
Of course, the standard should reflect test purposes: a "would" question
may be more appropriate for public accountability purposes, whereas a "should"
question may be suitable for the purposes of program improvement.
A "would" question yields a descriptive threshold of acceptability, whereas
the "should" question produces a prescriptive statement to suggest needed
improvements.
A third issue revolves around the question of how
we ensure that standard-setters reflect society’s and educators’ expectations?
This question of generalizability revolves around the size of the panel.
Theoretically, the audience for a public accountability report in Saskatchewan
would include almost the entire adult population of over seven hundred
thousand people. A statistically generalizable sample of panelists, with
acceptable rates of error, would number approximately one thousand in Saskatchewan.
Yet practical considerations of cost and coordination necessitate smaller
panels. In fact, standard-setters in Canada may better be described
as “jurors” rather than “judges”, a term which is used in American psychometric
literature. The label of “judge” suggests specialized expertise and
advanced professional preparation in an academic discipline, whereas the
public administrative standard-setting in Canada draws on the more general,
lay qualities of common sense, ability to approach evidence in an unbiased
manner, and good judgment sought in a typical court room juror. A
better comparison is with the jury of twelve people drawn from a variety
of walks of life, and without legal training, used in the legal system.
If impartial nonexperts are deemed acceptable for making “high-stakes”
rulings in criminal and civil actions, then a panel of nonexperts should
analogically be sufficient to represent the informed, “low-stakes” judgments
of citizens as part of a program evaluation.
The fourth and related issue relates to the wherewithal
for bringing diverse viewpoints together to yield a trustworthy standard.
A number of procedures have been developed in the United States, all of
which involve groups of experts making judgments about test items individually
or as groups, or about the competencies of examination candidates, to define
a passing score. All procedures aim to foster deliberative reflection among
panelists over several rounds of voting, and to eventually produce agreement
among them. Yet consistency is not the same thing as consensus. There
may be degrees of engagement within a consensual decision, ranging from
apathy to acquiescence to consent to consensus to commitment. We
have all sat on committees where peoples’ enthusiasm for a decision varies
dramatically. In other words, there is an affective element that
may mean that there is a meeting of minds about a decision, but not a wedding
of wills. As such, consensus means not only dissolving contradictory
views on acceptable student performance, but also extinguishing individual
positions and fostering group resolve. Extensive and careful preparatory
training of judges, provision of extensive evidence about the test and
the typical performances of students, statistical averaging or calculating
medians and ranges of ratings, and even exclusion of the erratic panelist,
have been recommended by many scholars as ways of ensuring uniform, reliable,
informed judgments.
A fifth issue is, when should the standard be set?
On first impulse, many would respond that the standard should be defined
before the test. Surely, they would say, it is a principle of fair
evaluation that those being evaluated should know ahead of time what the
standard is. Similarly, people often say that panelists’ judgments
should not be influenced by knowledge of test results, out of concern that
their expectations will be lowered or elevated because they will know how
students actually performed.
Yet virtually all scholars recommend setting a standard
after the test has been administered and the scores obtained, for three
reasons. First, we should not confuse the standard with the scale
used for marking student work. Fair evaluation means that students
should know ahead of time the criteria and rules for making judgments,
but decisions about the values assigned to information should wait until
after all data is collected using the scale. Second, panelists must
have the full range of information about test circumstances available to
make fair and fully informed decisions. A standard-setter in Manitoba
or Quebec needs to know if province-wide flooding or an ice storm may have
affected the learning or test performance of students in schools.
And third, information about how students actually performed must be considered
by panelists to make a fair judgment. Figure-skating judges do not
rate skaters before they’ve stepped on the ice, nor do courtroom jurors
render a decision before the plaintiff has become entangled in a dispute.
Thus, totally unrealistic judgments are avoided because panelists have
all the information before them. In some instances, statisticians
have had to adjust the standards afterward as a “compensatory technique”
because the judgments seemed unreasonable both to educators and to those
public officials who must assume responsibility for the standard.
Back to Table of Contents
The research project, a quantitative and descriptive
case study, addressed the problem: how reasonable are panelists' decisions
when setting criterion-referenced performance standards? The study
analyzed the evidence or reasons standard-setters offered when making judgments
about the quality of student outcomes. It explored collaborative
decision-making and educational standard-setting for reading and writing
outcomes in Saskatchewan. Trustees’, teachers’, business people’s, curriculum
writers’ and administrators’ views on the determinants of educational quality
were investigated as part of the project.
Standard setting is the second last phase, before
report-writing, in the Provincial Learning Assessment Program. The
low-stakes, random sample testing program is designed to provide reliable
information to the Department of Education, and to the general public about
Grades 5, 8, and 11 student skills in reading and writing. The Learning
Assessment Program's purposes are: to address issues of public accountability;
to provide data for program improvement; to enhance the skills of educators
in student evaluation; and to determine student achievement at two year
intervals so that a time series of student proficiency in "basic" and "higher
order" skills can be assembled.
Who set provincial literacy standards in 1996?
In fall 1996, twenty-five panelists were nominated by the SSTA, the STF, the Chamber of Commerce, LEADS and the Department of Education’s Curriculum and Instruction Branch to set standards for the 1996 Provincial Language Arts Learning Assessment. The 13 STF representatives included 7 classroom teachers, two principals, two vice-principals, and two central office program consultants. One of the three trustees had experience as a classroom teacher. The three Saskatchewan Education positions were filled by two language arts curriculum writers under secondment from classroom teaching duties, with one curriculum writer doing double duty on the Grade 5 and 8 panels. LEADS appointed an assistant director, a director and a superintendent of instruction. The Chamber of Commerce delegates were a personnel officer from a crown corporation, a retired manager from a government department who was currently managing a Chamber office, and a former engineer who was currently operating a consulting firm. Of the 25 panelists in total, 16 were female and 9 were male. In terms of parental status, 12 had school-age children and 11 did not; two did not indicate whether they had children or not.
How are standards set in Saskatchewan?
Actual performance standards were developed in a
three-stage multi-round voting process, repeated consecutively for each
skill domain of writing and reading under review. In the first stage,
the Department facilitator reviewed the scoring criteria used for student
performance, described actual scoring procedures, and provided examples
of student work which illustrated each scale point used for categorizing
student achievement. Judges were asked both to describe in their
own words the student skill under review, and to rewrite the performance
descriptions found in the 5-point scoring rubric for the audiences of the
Saskatchewan Education Indicators Program – the general public and public
officials – in two to three sentences each. This activity served
simultaneously as a means for having judges learn about the five point
scale used for scoring student work, as a way for stimulating and addressing
questions about the scoring of student work, and as a source of useful
terminology for describing student performance in subsequent report-writing.
The rewritten performance descriptions were discussed collectively.
Panelists were explicitly advised that student performance falling at the
first and second levels was deemed "unacceptable". Thus, level 3
performance was pre-defined as minimally acceptable performance.
Panelists were then asked, "In this skill area,
what percentage of the regular stream school population should attain each
performance level?" Without consulting others, each panel member
was invited to privately write down on the ballot form his or her preliminary
estimates of proportions of students who should attain each of the five
levels. Ballot forms were collected, and a mean distribution was
calculated by a psychometrician using a laptop computer. The provisional
distribution was visually displayed for all panelists on a liquid crystal
monitor in the form of a vertical bar graph, along with the upper and lower
estimates that had been offered by panelists. The psychometrician
verbally described the panel's provisional standards, and focused the group's
attention on those estimates that were most divergent for each level of
performance.
In the second stage, panelists were invited to individually
and orally provide comments on the preliminary mean distribution, and to
reveal their individual estimates if they wished. Standard-setters
were asked to focus on the nature or complexity of the task or test questions,
the criteria used for scoring student work, the examples of student work
presented, and attributes of the school population. Standard-setters
were also invited to comment on other factors which they deemed as important
considerations when appraising provincial student performance. Once
every panelist had spoken, a short group discussion was conducted to allow
additional viewpoints to be expressed. Members were then given the
opportunity to privately revise their preliminary estimates in light of
the insights and comments generated by the panel. The revised estimates
were written down on a ballot form, collected, and averaged to produce
a revised mean distribution.
In contrast to the first two "blind" rounds, the
third was an informed review: actual student results were provided.
This was accomplished by graphically displaying to panelists the revised
provisional distribution of estimates, and the upper and lower estimates
for each of the five performance levels, alongside actual provincial results
in parallel vertical bar graphs. The psychometrician provided a verbal
description of the provisional mean standards and actual student achievement,
focusing group attention on the upper and lower range estimates.
Then the Department facilitator again invited committee members in turn
to comment on the panel's revised mean distribution, and to participate
in a short group discussion. Having heard everyone speak, panelists
were allowed another opportunity to privately revise their estimates in
light of the comments made and the actual results presented. Ballots
were collected, and a mean distribution of the panel's expectations was
calculated to produce a provincial performance standard for the reading
or writing skill domain under consideration.
What is the Nature of a Saskatchewan Learning Assessment Standard?
Provincial literacy standards were set for four domains
of writing skills and four types of reading skills for each of the three
grades involved in the literacy assessment, using a variation of the Angoff
(1971) method. Three modifications were made to Angoff’s method.
First, the question was modified from “What percentage of students
would answer…” to “What percentage of students should attain…” so as to
yield desired rather than simply anticipated performance. Second,
panelists were asked to identify the percentage of students who should
attain each of 5 performance levels, rather than define a single level
of competence or incompetence. Developing a range of expectations
provides more sophisticated information for educators than setting a minimal
competence standard. Third, panelists provided global estimates for
each of the literacy domains adjudicated, rather than test item-by-item
ratings to be aggrgated. The focus was on the five performance levels
of criteria, rather than on individual test questions.
To investigate the reasoning patterns within the
Standards Committee, panelists were asked to identify the types of evidence
that influenced their thinking in reaching decisions. A 22-item,
Likert-type scaled questionnaire was used. Evidence was categorized
as direct, contextual or preconceptual, categories that conform roughly
to the legal concepts of direct, circumstantial and hearsay evidence. Ratings
were collected, tabulated and analyzed using multidimensional scaling techniques
to ascertain the panel’s perception of evidential relevance, and to determine
whether there were specific gender, parental, occupational or stakeholder
patterns in reasoning. The positions of panelists were mapped in
two-dimensional space. At the same time, actual voting patterns across
the exercise were analyzed to ascertain whether the Standards Committee
adopted a consensual and collaborative approach in its choices.
The study found that structured practice enabled
panelists to anchor their decisions more squarely in direct assessment
evidence. Equal emphasis was placed on preconceptual sources of information,
but decreasing emphasis was placed on contextual evidence. Preparatory
training, however, did not enable panelists to adopt a shared perspective
on the pertinence of the body of evidence available for their decision-making.
Panelists based their decisions in evidence that
related to their personal and professional knowledge and experience with
youth, and with literacy processes. Decisions were also substantively
grounded in actual achievement results and most types of direct assessment
information. Panelists did not base their decisions in information
derived from the broad social context. Standards Committee members perceived
the evidence as directly related to the large-scale assessment, as related
to personal and professional experience and knowledge, and as contextual
or societal in its origins (See Appendix A). Panelists’ occupation
and stakeholder affiliation were not related to evidential preferences;
nor was parental status or gender (See Appendix B).
Stakeholder judgments converged when defining standards
for overall writing ability, but diverged when adjudicating reading comprehension.
However, stakeholders adopted a shared outlook on that evidence which was
pertinent in decision making about provincial literacy standards.
After reviewing the voting patterns of panelists, the types of evidence
considered as relevant during the exercise, and the positions of panelists
when setting standards, the study concluded that the educational standards
set for the 1996 Learning Assessment appeared reasonable.
The table below illustrates Grade 8 results for the
1996 Reading and Writing Assessment. Horizontal bars show test outcomes
for the various dimensions of student performance assessed, in terms of
the percentage of the provincial student population who reached acceptable
levels of achievement. The margin of statistical error, stemming from possible
measurement and sampling error, shows the interval in which we can have
confidence in the results. The triangles illustrate the Standards
Committee’s work, which defined desirable provincial student performance.

The central question, then, is whether the triangles
in the graph are fairly and appropriately placed to signify provincial
literacy standards for student performance? Of course, readers will
have different answers to this question of reasonableness in expected performance
– depending on their values, their experiences, and their views of the
education system’s quality. Because of likely divergence in views,
the credibility of a standard must rest on something other than the perspectives
of the bystander. Appeal courts in the Canadian justice system do
not consider the opinions of ill informed observers, but rather examine
the ways which lower courts have reasoned with evidence to make their decisions.
In other words, the types of information considered by panelists, and their
reasoning with that information, are key to understanding whether a standard
is sound. A reasonable standard is one that that has been carefully
grounded in relevant evidence.
Before answering the question of reasonableness
in decision-making, it is useful to consider alternate notions of reasonableness
as discussed in ethics – the philosophical study of concepts of the desirable.
The most formal definition is grounded in logic. The presumption
is that one can deductively draw conclusions from evidence which will be
compelling or true across all circumstances. Reasonableness is tied
to the rules of logic and strict rules of evidence. In this definition,
a reasonable decision about educational standards would focus on the inferential
link between direct test evidence and the articulated performance standard.
It would consider neither the contradictory value positions of panelists
nor the social interactions involved in creating the standard.
More inclusive and flexible notions of reasonableness
respect diversity of opinion (Burbules, 1995). Reasonableness is
not found in the quality of the logic, but rather in such human virtues
as willingness to compromise, consideration of the context, and processes
of deliberation, reflection, discussion and change. It is in how
and when persons change their minds that their reasonableness manifests
itself. Reasonableness is a socially constructed notion which includes
approaching problems with an open mind and sensitivity in a pluralistic
society, and a willingness and a capacity to adapt to alternative positions.
In these lights, the extent to which standard-setters consider varying
cultural perspectives, attend to and accommodate co-panelists’ views, and
make prudent and moderate adjustments in light of a variety of discordant
evidence, would be criteria for determining the reasonableness of an educational
standard. Procedural fairness ranks high in this notion of reasonableness.
A third view suggests that reasonableness does not
stem from a given social context, but rather in our ideals or beliefs about
the purposes of education. For Seigel (1988), a reasonable judgment
must appeal to something other than the process through which the judgment
was reached. In his optic, a reasonable standard would be substantively
grounded in the decision-maker’s vision for education, in curriculum objectives,
in knowledge of literacy processes, in test purposes and consequences,
and only secondarily in actual test material. It is not the decision-maker’s
disposition to adapt, nor her or his repositioning in light of changing
evidence, but the participant’s predispositions which are central.
Reviewers may also find legal notions of reasonableness
useful when considering the decisions of a committee or board. Certainly,
the “low-stakes” purposes of the Provincial Learning Assessment Program
and the Standards Committee’s legal status as an advisory body, not an
administrative tribunal or quasi-judicial authority, make it unlikely that
its decisions would be reviewed by a superior court. The Minister
can choose to endorse or disregard the advice tendered. Yet judicial
reviews of decision making by tribunals and other administrative agencies
have defined a largely negative criteria of reasonableness. The idea
of a “patently unreasonable interpretation” has taken shape in a series
of Supreme Court of Canada decisions, originating with Canadian Union
of Public Employees Local 963 v. New Brunswick Liquor Corporation (1979).
Courts will not intervene with the decisions of administrative bodies unless
they are “patently unreasonable.” Justice Dickson illustrated what
he meant by this with examples from Service Employees’ International
Union, Local No.333 v. Nipawin District Staff Nurses Association, (1975):
The ultimate audiences for provincially-defined performance
standards from the Provincial Learning Assessment Program are principals
and teachers. Section 175 (k) of the Education Act invests the principal
with the responsibility for determining school level standards; her or
his duties include establishing in consultation with the teaching staff,
the procedures and standards to be applied in evaluation of the progress
of pupils." Thus, it is clearly the responsibility of the school administrator,
in concert with her or his staff, to determine educational standards for
determining whether individual students pass or fail in “high stakes” situations.
Provincial standards are for educators’ professional consideration and
guidance in their work.
Nevertheless, central office administrators and
school boards are given responsibility for making decisions in a wide range
of educational endeavour. Those decisions affect the lives of students,
teachers, support staff, parents, and community members. Legal and
ethical notions of reasonableness are important because educational and
administrative decisions involve various forms of standards. Standards
are, in essence, value-laden choices about what we deem to be desirable
or undesirable. We cannot avoid those choices, but we can ensure
that they are prudently made. Thus, educators and trustees might
consider the following seven points when questions about standards and
decisions arise:
American Educator (1996). Presidential address to the national education summit, 20 (1), 8-12.
Angoff, W.H. (1971). Scales, norms and equivalent scores. In R.L. Thorndike (Ed.), Educational measurement (2nd ed., pp 508-600). Washington, DC: American Council on Education.
Aronowitz, S. (1996). National standards would not change our cultural capital. The Clearinghouse, 69 (3), 144-147.
Berk, R.A. (1986). A consumer's guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56 (1), 137-172.
Berube, M.( 1996). The politics of national standards. The Clearinghouse, 69 (3), 151-153.
Bourque, M.L. & Hambleton, R.K. (1993). Setting performance standards on the national assessment of educational progress. Measurement and Evaluation in Counselling and Development, 4 (26), 41-48.
Burbules, N. (1993). Rethinking rationality: On learning to be reasonable. Proceedings of the forty-ninth annual meeting of the Philosophy of Education Society. New Orleans, LA.
Burbules, N. (1995). Reasonable doubt: Toward a postmodern defense of reason as an educational aim. In Wendy Kohli (Ed.), Critical conversations in philosophy of education (pp.82-102). New York: Routledge.
Canadian Union of Public Employees Local 963 v. New Brunswick Liquor Corp., (1979) 2 S.C.R. 227.
Cizek, G.J. (1996). Standard-setting guidelines. Educational Measurement: Issues and Practice, 15(1), 13-21.
Council of Ministers of Education, Canada. (1997). School achievement indicators program: 1996 Report on science assessment. Toronto: Author.
Council of Ministers of Education, Canada. (1997). 1996 SAIP science assessment: Pan-Canadian expectations-setting sessions. Toronto: Author.
Deutsch, M. (1975). Equity, equality and need: What determines which value will be used as the basis of distributive justice? Journal of Social Issues, 31 (3), 137- 149.
Economic Council of Canada. (1992). A lot to learn: Education and training in Canada. Ottawa, ON: Supply and Services Canada.
Eisner, E. N. (1995). Standards for American schools: Help or hindrance. Phi Delta Kappan, 76 (10), 758-764.
Gittell, M. (1996). National standards threaten local vitality. The Clearinghouse, 69 (3), 148-150.
Glass, G.V. (1978). Standards and criteria. Journal of Educational Measurement, 15, 237-261.
Gunn, L.D. (1982). Debra P. v. Turlington: Due process enters the classroom, but how far? Journal of Law and Education, 11 (4), 573-585.
Hambleton, R.K. & Powell, S. (1983). A framework for viewing the process of standard-setting. Evaluation & the Health Professions, 6 (1), 3-24.
Hambleton, R.K. & Eignor, D. (1978a). A practioner's guide to criterion-referenced test development, validation, and test score usage. Laboratory of Psychometric and Evaluative Research Report No, 70. Amherst, MA: University of Massachusetts.
High School Review Advisory Committee. (1994). Final report. Regina, SK: Saskatchewan Education, Training and Employment.
Howe, K. (1994). Standards, assessment, and equality of educational opportunity. Educational Researcher, 23 (18), 27-33.
Hunter, D. & Gambell, T. (1996). Setting standards for a provincial literacy assessment: Premises and procedures. McGill Journal of Education, 31(2), 195-214.
Jaeger, R.M. (1989). Certification of student competence. In R.L.
Linn (ed.) Educational Measurement (pp. 485-514). London: Collier-Macmillan.
Jaeger, R.M. (1991). Selection of judges for standard-setting.
Educational Measurement: Issues and Practice, 10 (2), 3-10.
Jaeger, R.M. (1995). Setting standards for complex performances: an iterative, judgmental policy-capturing strategy. Educational Measurement: Issues and Practice, 14 (4), 16-20.
Jones, R. & Hunter, D. (1996). Setting achievement standards/expectations for large-scale student assessments. The Canadian Journal of Program Evaluation, 11(1), 35-61.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64 (3), 425-461.
Larter, S. (1991). Benchmarks: The development of a new approach to student evaluation. Toronto: Toronto Board of Education.
Lewis, A.C. (1995). Overview of the standards movement. Phi Delta Kappan, 76 (10), 744-750.
Livingston S.A. & Zieky, M.J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Logar, A. (1984). Minimum competency testing in schools: Legislative action and judicial review. Journal of Law and Education, 13 (1), 35-49.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement. (3rd ed., pp .13-103). Washington, DC: The American Council on Education and the National Council on Measurement in Education.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23 (2), 13-23.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14 (4), 5-8.
Norcini, J.J., Shea, J. & Kanya, D.T. (1988). The effect of various factors on standard-setting. Journal of Educational Measurement, 25 (1), 57-65.
Popham, W.J. (1978a). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.
Popham, W.J. (1978b). Setting performance standards. Los Angeles, CA: Instructional Objectives Exchange.
Porter, A. (1995). Uses and misuses of opportunity-to-learn standards.
Educational Researcher, 24 (1), 21-27.
Principles for fair student assessment practices for education in
Canada. (1993) Edmonton, Alberta: Joint Advisory Committee.
Ravitch, D. (1995). National standards in American education. Washington, DC: Brookings Institution.
Rothman, R. (1995). Measuring up: standards, assessment and school reform. San Francisco: Jossey Bass.
Saskatchewan School Trustees Association. (1996). Setting standards in education: Saskatchewan Standards Symposium. SSTA Research Centre Report # 96-02.
Service Employees International Union, Local No. 333 v. Nipawin District Staff Nurses Association, (1975) 1 S.C.R. 382.
Shepard, L. (1980). Standard-setting issues and methods. Applied Psychological Measurement, 4 (3), 447-467.
Siegel, H. ( 1988). Educating reason: Rationality, critical thinking, and education. New York: Routledge.
Siegel, H. (1992). Two perspectives on reason as an educational aim: The rationality of reasonableness. Proceedings of the forty-seventh annual meeting of the Philosophy of Education Society. Normal, IL.
U.E.S., Local 298 v. Bibeault, (1988) 2 S.C.R. 18609.
Willis, S. (1997) . National standards: Where do they stand. Education Update: Association for Supervision and Curriculum Development, 39(2), 1-8.
Panel: Saskatchewan Reading and Writing Learning Assessment
| Evidential Type |
|
|
|
|
|
|
| Personal Experience |
|
|
|
|
|
|
| Professional Experience |
|
|
|
|
|
|
| Vision for Education |
|
|
|
|
|
|
| Test Questions/Tasks |
|
|
|
|
|
|
| Assessment Procedures |
|
|
|
|
|
|
| Scoring Procedures |
|
|
|
|
|
|
| Co-Panelists' Views |
|
|
|
|
|
|
| Item Difficulty Statistics |
|
|
|
|
|
|
| Examples of Student Work |
|
|
|
|
|
|
| Test Results |
|
|
|
|
|
|
| Organizational Standards |
|
|
|
|
|
|
| Precedent |
|
|
|
|
|
|
| General Reports on System |
|
|
|
|
|
|
| Friends and Colleagues |
|
|
|
|
|
|
| Media Reports |
|
|
|
|
|
|
| Personal Sense of Performance |
|
|
|
|
|
|
| Curriculum Objectives |
|
|
|
|
|
|
| Curriculum Implementation |
|
|
|
|
|
|
| Student Population Descriptions |
|
|
|
|
|
|
| Knowledge of Literacy Processes |
|
|
|
|
|
|
| Varying Cultural Perspectives |
|
|
|
|
|
|
| Test Purposes and Consequences |
|
|
|
|
|
|
| N |
|
|
|
|
|
|

