How Good Tests Are Made

Anyone can throw together a test in an afternoon—people do it all the time. However, such tests do not make good decisions.

Developing tests that make dependable and accurate decisions about test taker abilities is a difficult undertaking. It takes time, a lot of money and most of all, a lot of expertise.

In this section, we describe the process in a number of steps. These are:

  • Clarify the purpose of the test
  • Define the construct
  • Design the test
  • Create the items (or tasks)
  • Pilot every item (or task)
  • Select the measurement model
  • Create the IRT scale
  • Evaluate the items
  • Assemble the test forms
  • Create a reporting scale and equate the forms
  • Set performance standards
  • Write up documentation
  • Field test and validate the system
  • Ongoing test development, review and evaluation

Step One: Clarify the Purpose of the Test

We start by deciding what decisions need to be made based on the test results? For example, has the test taker successfully completed a particular course of study; or does the test taker have enough English to successfully undertake an undergraduate course of study in an English-speaking university; or does the test taker have enough French to live independently in France.

These are the decisions that the test needs to make. The quality of a test is determined by the extent to which the test makes ‘appropriate’ or ‘right’ decisions.

Step Two: Define the Construct

The construct is what the test measures. In other words, what particular knowledge, skill, or ability the test must measure to enable it to make the right decisions.

There are many ways to determine this. In some cases it is clear what the test needs to measure. In other cases it may be necessary to carry out a research study to find out. This is called a Needs Analysis, or sometimes a Job Analysis. The researcher looks specifically at what the test taker should do. Typically researchers use:

  • Observations
  • Questionnaires
  • Interviews
  • Discussions with experts
  • Etc.

A Needs Analysis is a serious research project, and may be a big undertaking to do properly.

Step Three: Design the Test

This is a creative process in which all the previous information is brought together into one design. The design of the test is a documents called the Test Specification. In effect, this is an operational definition of the construct. The Table of Specifications will describe how many sections on the test, how many items in each section, and what the items are like. The is usually a set of Item (or Task) Specifications that lay out in detail what each item should do, what information they should target, how they should be written, and so forth.

Step Four: Create the Items (or Tasks)

This is a high level skill, that takes many years to learn well, and is better done by professionals. If the domain of knowledge is very specialized, the item writers will work closely with content experts. Test items are very complex, and item writers cannot always imagine the many ways test takers can respond to their items. So after they have been written, items need to go through a number of review stages. They are generally reviewed for: 1.To ensure that they are skillfully written 2.To ensure the content is appropriate 3.To identify any bias or sensitivity issues since some items will be lost during the piloting stage, it is customary to write more items than are necessary. The number of additional items depends on the quality of the test. For important tests it is advisable to write 50% to 100% more than will be required for the final test forms. For less important tests, 25% may be enough. Less experienced items writers will lose more items due to piloting than more experienced writers.

Step Five: Pilot Every Item (or Task)

Test items are very complex, and often do not work as the writer intended. Thus, all test items, all test tasks of any type, need to be piloted. That means they must be tried out on a representative sample of target test takers, and the sample must be large enough for the particular statistical procedures to be carried out. Depending on the type of analysis planned, each item should be taken by at least 40 test takers for classical item analysis, about 200 to use the Rasch Model (a 1-parameter IRT model), and at least 500 or more to use a 3-parameter model. Piloting is inconvenient, time-consuming and expensive, but it is not possible to develop good tests without it.

Step Six: Select the Measurement Model

At this point of the process, we have a pool of items, but this is not a test. We need to take these items and turn them into a measurement instrument. There are a number of different ways to do that, using Classical Test Theory, or using Item Response Theory (IRT). IRT is the preferred method, and nowadays all important tests are made using IRT. In this example, we will describe the test development process using the Rasch Model, the simplest and most widely used IRT model.

Step Seven: Create the IRT Scale

The first step of the process is to create an IRT scale and calibrate all items on that scale, using the data from the pilot administration. The Rasch scale is a probabilistic scale, with both item difficulty and test taker ability expressed on same scale. The units of the Rasch scale are logits. Although the scale is open-ended, it usually runs from approximately -4.0 to +4.0 logits, with 0 set to be equal to the mean of the item difficulties.

Step Eight: Evaluate the Items

Based on this analysis, items are evaluated for quality and appropriacy. This usually involves looking at their difficulty, their fit to the Rasch model, and correlations with other items or parts of the test. Good items are kept, while poor ones are thrown out, or sent for revision.

Step Nine: Assemble the Test Forms

Versions of the test are then built, based on the Test Specifications, item content and the statistical qualities of the items. These various test forms should be as similar to each other as possible. They should be parallel in structure, in layout, in the number of items of each type, and in content. They should be designed to measure the same skills.

Step Ten: Create a Reporting Scale and Equate the Forms

The Rasch scale is not a suitable scale for reporting test scores. A typical score on the Rasch scale might be -0.763, or +2.699. Most users are uncomfortable with a negative score, and with small decimal numbers. A reporting scale needs to be created that score users feel more comfortable with. This must be a linear transformation of the Rasch scale, and can be anything that is acceptable. Scales running from 0 to 100, or from 200 to 800 are typical. As part of this process, the various test forms are equated. Since the different test forms contain items of different difficulties, they will vary in difficulty. One form will be harder than another and a certain number of items correct on one form will not represent the same ability level as the same number correct on a different form. Thus each form will have a different conversion to the reporting scale, to take account of this.

Step Eleven: Set Performance Standards

If tests are to be use for certification purposes, or if passing scores need to be set, then we need a Standard Setting Study. This is a complex procedure, and is usually a separate research study in its own right. A group of experts are assembled, and their advice is sought. There are many different ways this can be done, and the process is both complex and contentious. The basic process involves discussion of the passing requirements, individual ratings of the items, collective discussion and comparison of these ratings, and consideration of the implications of these for a passing score. The process usually involves a number of iterations until consensus is reached, and often takes a number of days.

Step Twelve: Write up Documentation

Any testing system needs to be accompanied by many different documents: Item development manuals, Administration manuals, Test taker guidelines, Score interpretation guides, Technical manuals, Validation reports, Research studies. Writing such manuals requires experience and specialized skills. Each manual addresses a different audience, each having different requirements and needing different expertise. Good manuals need to be customized to meet the needs of the particular users.

Step Thirteen: Field Test and Validate the System

For many large scale testing systems, it is normal, to carry out large-scale field testing of the system. This has a number of purposes: it tests the operational aspects of the system to see how well things work, it provides normative data for specific groups of interest and it provides evidence of the validity of the test.

Step Fourteen: Ongoing Test Development, Review and Evaluation

Of course, developing the test is only the first part. During the useful life of the test, new items will need to be written, and new test forms need to be developed. Test performance needs to be monitored on a regular basis. Test takers data needs to be analyzed on a regular basis, and revised technical reports need to be created. Validation is an ongoing requirement, and any high-stakes testing system need to accumulate a variety of research studies and validation reports, which continue to explore the test and the meaning of the test scores.

Conclusion

This is a lot of work, but this is how good tests are made. If your tests are making serious decisions — decisions that have an impact on the lives of the test takers — then you need to be going through a similar process.