How Good Tests Are Made

How Good Tests Are Made

Anyone can throw together a test in an afternoon—people do it all the time. However, such tests do not lead to good decisions.

Developing tests with results that can lead to dependable and accurate decisions about test taker abilities is a difficult undertaking. It takes time, a lot of money and most of all, a lot of expertise.

In this section, we describe the process in a number of steps. These are:

1. Clarify the purpose of the test
2. Define the construct
3. Design the test
4. Create the items (or tasks)
5. Pilot every item (or task)
6. Select the measurement model
7. Create the IRT scale
8. Evaluate the items
9. Assemble the test forms
10. Create a reporting scale and equate the forms
11. Set performance standards
12. Write up documentation
13. Field test and validate the system
14. Ongoing test development, review and evaluation

1. Clarify the Purpose of the Test

We start by deciding what decisions need to be made based on the test results.

For example: has the test taker successfully completed a particular course of study; or does the test taker have enough English proficiency to successfully undertake an undergraduate course of study in an English-speaking university; or does the test taker have enough French to live independently in France?

These are the decisions that the test needs to make. The quality of a test is determined by the extent to which the test results help to make the ‘appropriate’ or ‘right’ decisions.

2. Define the Construct

The construct is what the test measures. In other words, what particular knowledge, skill or ability the test must measure to enable the results to contribute to solid decision-making.

There are many ways to determine this. In some cases, it is clear what the test needs to measure. In others, it may be necessary to carry out a research study to find out. This is called a Needs Analysis, or sometimes a Job Analysis. The researcher looks specifically at what the test taker should do. Typically, researchers use:

  • Observations
  • Questionnaires
  • Interviews
  • Discussions with experts

A Needs Analysis is a serious research project and may be a big undertaking to carry out properly.

3. Design the Test

This is a creative process in which all the previous information is brought together into one design.

The design of the test is a document called the Test Specifications. In effect, this is an operational definition of the construct.

The Table of Specifications will describe how many sections the test has, how many items are in each section, and what the items are like.

There is usually a set of Item (or Task) Specifications that lay out in detail, for each item, what an item should do, what information it should target, how it should be written, and so forth.

4. Create the Items (or Tasks)

This is a high-level professional skill that takes many years to learn well and is better done by professionals. If the domain of knowledge is very specialized, the item writers will work closely with content experts.

Test items are very complex, and item writers cannot always imagine the many ways test takers can respond to their items. So after they have been written, items need to go through a number of review stages. They are generally reviewed:

  • To ensure that they are skillfully written
  • To ensure the content is appropriate
  • To identify any bias or sensitivity issues

Since some items will be lost during the piloting stage, it is customary to write more items than are necessary. The number of additional items depends on the quality and/or importance of the test. For important tests, it is advisable to write 50 to 100 percent more than will be required for the final test forms. For less important tests, 25 percent more may be enough. Less experienced item writers will lose more items during piloting than more experienced writers.

5. Pilot Every Item (or Task)

Test items are very complex and often do not work as the writer intended. Thus, all test items—all test tasks of any type—need to be piloted. That means they must be tried out on a representative sample of target test takers, and the sample must be large enough for the particular statistical procedures to be carried out.

Each item should be taken by at least 40 test takers for classical item analysis, about 200 to use the Rasch Model (a 1-parameter IRT model), and at least 500 to use a 3-parameter model.

Piloting is inconvenient, time-consuming and expensive, but it is not possible to develop good tests without this step.

6.  Select the Measurement Model

At this point of the process, we have a pool of items, but this is not a test. We need to take these items and turn them into a measurement instrument.

There are a number of different ways to do that using Classical Test Theory or Item Response Theory (IRT). IRT is the preferred method, and nowadays all important tests are made using IRT. In this example, we will describe the test development process using the Rasch Model, the simplest and most widely used IRT model.

7.  Create the IRT Scale

The first step in the process of creating a measurement instrument from our pool of items is to create an IRT scale and calibrate all items on that scale using the data from the pilot administration.

The Rasch scale is a probabilistic scale, with both item difficulty and test taker ability expressed on the same scale. The units of the Rasch scale are logits. Although the scale is open-ended, it usually runs from approximately -4.0 to +4.0 logits, with 0 set to be equal the mean of the item difficulties.

8. Evaluate the Items

Based on this analysis, items are evaluated for quality and appropriateness. This usually involves looking at their difficulty, their fit to the Rasch model and correlations with other items or parts of the test. Good items are kept, while poor ones are thrown out or sent for revision.

9.  Assemble the Test Forms

Versions of the test are then built based on the Test Specifications, item content and statistical qualities of the items.

These various test forms should be as similar to each other as possible. They should be parallel in structure, layout, number of each type of item and content. They should be designed to measure the same skills.

10.  Create a Reporting Scale and Equate the Forms

The Rasch scale is not a suitable scale for reporting test scores. A typical score on the Rasch scale might be -0.763 or +2.699. Most users are uncomfortable with a negative score and with small decimal numbers. So, we must create a reporting scale that score users feel more comfortable with. This scale must be a linear transformation of the Rasch scale and can be anything that is acceptable. Scales running from 0 to 100, or from 200 to 800, are typical.

As part of this process, the various test forms are equated. Since the different test forms contain items of different difficulties, the forms will vary in difficulty. One form will be harder than another and a certain number of items correct on one form will not represent the same ability level as the same number correct on a different form. Thus each form will have a different conversion to the reporting scale to account for this.

11. Set Performance Standards

If tests are to be used for certification purposes, or if passing scores need to be set, then we need a Standard Setting Study. This is a complex procedure and is usually a separate research study in its own right.

A group of experts are assembled, and their advice is sought. There are many different ways this can be done; the process is complex and experts often disagree about the best methods. The basic process involves discussion of the passing requirements, individual ratings of the items, collective discussion and comparison of these ratings and consideration of the implications these factors have for a passing score. The process usually involves a number of iterations until consensus is reached and often takes a number of days.

12.  Write up Documentation

Any testing system needs to be accompanied by many different documents.

  • Item development manuals
  • Administration manuals
  • Test taker guidelines
  • Score interpretation guides
  • Technical manuals
  • Validation reports
  • Research studies

Writing such manuals requires experience and specialized skills. Each manual addresses a different audience, each with different requirements and different kinds and levels of expertise. Good manuals need to be customized to meet the needs of their particular users.

13. Field Test and Validate the System

For many large-scale testing systems, it is normal to carry out large-scale field testing of the system. This has a number of purposes:

  • It tests the operational aspects of the system to see how well things work
  • It provides normative data for specific groups of interest
  • It provides evidence of the validity of the test

14. Ongoing Test Development, Review and Evaluation

Of course, developing the test is only the first part.

During the useful life of the test, new items will need to be written, and new test forms will need to be developed.

Test performance needs to be monitored on a regular basis. Test takers’ data needs to be analyzed regularly, and revised technical reports need to be created. Validation is an ongoing requirement, and any high-stakes testing system needs to accumulate a variety of research studies and validation reports, which continue to explore the test and the meaning of the test scores.


This process is a lot of work, but this is how good tests are made. If the results of your tests are contributing to making serious decisions—decisions that have an impact on the lives of the test takers—then you need to be going through a similar process.


Clip to Evernote

Comments are closed.

Corporate HQ

Lidget Green
700 Briggs Ave. #92
Pacific Grove CA 93950
Phone: (831) 222-3677
Fax: (831) 222-3672

Contact Us

Contact us for more information about our products and services, or to share comments.

Contact Us

Join Our Team

Lidget Green is always looking for great talent. Check to see if we have positions available.

Learn More


Insights about education, test development, English language acquisition, and more!

Visit Viewpoints