Item Writing Policy

Item Writing Policy

Richard A. Feely, D.O. principle type of multiple choice item formats is the conventional multiple choice test with four distractors; rarely will we use matching, extending matching, problem solving item sets, vignette or item scenario item sets. There are four areas of item writing guidelines that uses.

I. Content Guidelines

Every item reflects specific content and a single specific cognitive process, as called for in the test specifications.

1) The content should come from a topic outline or list of major topics, books, etc. as dictated by the particular board examiners or college course. All content can essentially be reduced to facts, concepts, principles, and procedures. Generally topics subsume this distinction. The cognitive demand is usually recalled or understanding but if the intent of the item is to infer status to the ability such as problem solving, the application of knowledge and skills is assumed.

2) Base each item on something important to learn; avoid trivial content. The judgment of the importance of content and the cognitive demand is subjective and is appropriately done by experts in deciding a particular item for test.

3) Use novel material to test for understanding and other forms of learning. The testing of understanding is that of recall and can be done using some strategies or a concept principle or procedure is tested in a novel way. To achieve this novelty, content presented at the textbook is not reproduced in the test item; instead one asks the students to identify an example of something or re-paraphrase a definition to see if a student can link the paraphrase definition to a concept principle or procedure. This type of concept learning usually is presented in a scenario or vignettes that ask for critical thinking and problem solving. Below are two examples. The first item provides dictionary definitions. The second item provides examples of writing as the student understands a metaphor is supposed to select the correct example.

Which is the best definition of a metaphor?
A) Metaphor describes something as if it were something else.
B) Metaphors make comparisons to other things.
C) Metaphors are trite, stereotype expressions.

Which of the following is a metaphor?
A) At the breakfast buffet, I ate like a pig.
B) My cat has fur like knotted wool.
C) She is like a rose full of thorns and smelly.

4) Keep the content of the item independent of the content of other items on the test. A tendency when writing sets of items is to provide information in one item that helps the test taker answer another item. Do not do that.

5) Over-specific or over-general content. Avoid over-specific or over-general content. The concept of specificity of knowledge refers to continual knowledge from specifics to general. Most items should probably be written with this in mind. We should avoid the extremes. Over-specific knowledge tends to reveal to the domain of knowledge intended. General knowledge may have many exceptions or are ambiguous. The judgment of specificity and generality of the items should be reviewed by others who can help judge the specificity and generality of each item.

6) Avoid opinion based items. Items should reflect well-known publicly supported facts, concepts, principles, and procedures. One should qualify each opinion such as according to the Nei Jing…?

7) Avoid trick items. Trick items are intended to deceive the test taker into choosing a distractor instead of the right answer. Trick items are difficult to illustrate. Trick items can be deliberately intended by the item writer or sometimes accidentally become trick items to test takers. There are seven types of items that students perceive as tricky. These are 1) The item writers intention appear to be deceive, confuse, or mislead test takers; 2) Trivial content was represented; 3) The discrimination among opinions was to find; 4) Items had window dressing that was irrelevant to the problem; 5) Multiple correct answers were possible; 6) Principles were presented in ways that were not learned, thus, deceiving the students; 7) Items were so highly ambiguous, even the best students, had no idea about the right answer.

The following are open-ended trick items. 1) Is there a 4th of July in England? Some months have 31 days. How many have 28?

II. Style and Format Concerns.

8) Format items vertically instead of horizontally. Horizontal formatting is harder to read thus confusing students and lowering test scores.

9) Edit items for clarity. Early in the development that item should be subject to scrutiny by a qualified editor to determine if the central ideas presented as clearly and concisely as possible.

10) Added items for correct grammar, punctuation, capitalization, and spelling. Acronyms may be used but there use should be done carefully. Generally acronyms are explained in the test before being reused.

11) Simplify vocabulary. The purpose of most multiple choice achievement test is to measure knowledge and skills that were supposed to be learned. Research shows limited English proficient students performed better when the language was simplified.

12) Minimize reading time. Items may be unnecessarily wordy. Verbosity is also an enemy to clarity. The time passed for verbose items limit the number of items that can be asked in this time and has a negative effect upon the adequacy of sampling content and the reliability of test scores.

13) Proofread each item. A highly recommended procedure is in the production of any test needs to proofread it. A good rule of thumb from expert editors is that if you spot three errors in the final proofreading phase of final test development, you have probably missed one error.

III. Writing the Stem.

14) Make directions as clear as possible. The stem should be written in a way that the test taker knows what the focus of the item is. For example, the two following questions, are examples of unclear first and clear directions in the stem as demonstrated in the second example.

Bad example:
Plant in a flower pot fell over. What happened?
Clear example:
A plant growing in a flower pot was turned on its side. A week later, what would you expect to see?

15) Make the stem as brief as possible.

16) Place the main idea in the stem not in the choices. Item’s stems should always contain the main idea. The test taker should always know what is being asked in the item after reading the stem. When an item fails to perform as intended with a group of students who have received appropriate instruction, there are often many reasons. One reason is that the stem did not present the main idea.

17) Avoid irrelevant information (window dressing). Some items contain entire sentences that contain that have nothing to do with the problem stated in the stem. One reason is to make the item look more life like or realistic. For many good reasons, listed in guidelines 9, 11, 12, 14, and 15 window dressing is not needed. However, there are times when verbage in the stem may be appropriate in problems with test takers sorts through information and distinguishes between relevant and irrelevant information to solve a problem. In this case, excessive information may be necessary. In this instance, the purpose of excessive information is to see if the examinee can separate useless from useful information.

18) Avoid negative words in the stem. Researchers in the field have shown that negative words in the stem have negative effects on students and their responses to such items. They also have found that students have difficulty with negatively phrased items. Negatively phrased items require twice as much working memory as positive phrase forms of such an item. Negative words appearing both in the stem and in one or more options might require four times as much working memory as a positively phrased equivalent item. Other experts have noted that the human brain cannot think in the reverse of an idea with any ease at all. If a negative term is used (such as not, except), it should be emphasized by placing it in bold type, capitalizing it, underlining it, or all of the above.

IV. Writing the Choices.

19) Use as many choices as possible but three seems to be the natural limit. In most of tests we require four choices or four distractors. Unless otherwise requested, please provide four.

20) Vary the location of the right answer according to the number of options. Assign the position of the correct answer randomly. At, our computer technology randomly assigns the options. For your purposes as an item writer, please make ‘A’ the first option ALWAYS the correct answer. For our purposes at, please list the source underneath ‘A,’ the book, author, date, publisher, and page number.

21) Place options in logical or numerical order. The answers should always be arranged in ascending or descending numerical order. Always place the decimal points in quantitative answers. Decimal points should be aligned for easy reading. Always with a zero in front of the decimal point such as 0.250 as opposed to .250.

22) Keep choices independent. Choices should not be overlapping. If options are overlapping, these options are likely to give a clue to the test taker about the correct answer and the distractors. If values are given, do not use overlapping options such as:

What age range represents the physical peak of life?
A) 11-15 years
B) 13-19 years
C) 18-25 years
D) 24-32 years

Avoid the overlapping options in all cases.

23) Keep choices homogenous in content and grammatical structure. If the correct answer is shorter or more specific or stated in other language perhaps more technical or less technical, this might make the items easier to identify and correctly answer.

24) Keep the length of choices about the same. One common fault in item writing is to make the correct answer the longest. Please do not do that.

25) None of the above should be used sparingly. As the last option, ‘none of the above’ is easy to construct. Research has increased controversery over these guidelines and has split authors over this option as a distractor.

26) Avoid using ‘all of the above.’ The use of ‘all of the above’ is a good device for capturing information where one, two, or even three right answers exist. However, the use of this choice may help test wise test takers.

27) Avoid negative words such as ‘not’ or ‘accept.’ We should phrase stems positively and the same advice applies to the options. The use of negative such as ‘not’ and ‘except’ should be avoided and options as well as the stem.

28) Avoid options that give clues to the right answer. These include the following, i.e., specific determiners. Specific determiners are so extreme that seldom are they the right answer. These include such terms as ‘always,’ ‘never,’ ‘totally,’ ‘absolute,’ or ‘completely.’ A specific determiner may occasionally be the right answer. In these instances, their use is justified if the distractors also contain other specific determiners.

Avoid clang associations. Sometimes a word or phrase that appears in the item’s stem will also appear in the list of choices and the word or phrase will be the correct answer. If a clang association exists and a word or phrase is not the correct answer, the item may be a trick question.

Options should be homogenous with respect to grammar. Sometimes a grammatically error in writing may lead a test taker to the right answer.

Options should be homogenous with respect to content. A wise test taker student is likely to choose the heterogenous option.

Blatantly absurd and ridiculous options. When writing that third or fourth option, there is a temptation to develop a ridiculous choice either as humor or out of desperation. In either case, the ridiculous case will seldom be chosen and in taking board examinations such as preparation, we do not use humor or wish to have trick questions. Please avoid this at all costs.

29) Make all distractors plausible. Multiple choice questions are used to measure knowledge and cognitive skills. Therefore, the right answer must be right and wrong answer must be wrong. The key developing wrong answer is plausibility. Plausibility refers to the idea that an item should be correctly answered but should be correctly answered by those who possess a high degree of knowledge and incorrectly answered by those who possess a low degree of knowledge. The plausible distractor will look like a right answer to those that lack this knowledge. The effectiveness of a distractor can be statistically analyzed where only 3% of the students choose that option. One will conclude that this option is very implausible. Writing plausible distractors comes from hard work and is the most difficult part of a multiple choice item writing.

30) Use typical errors of students when writing distractors. The good plausible distractor comes from a common understanding of common student errors and writing them as a distractor.

31) Use humor if it is compatible with the teacher. Avoid humor in formal testing situations. is a formal testing situation, therefore, please avoid humor.

32) Specific guidelines for writing matching items. The 31 general guidelines enumerated above should be used, however, for writing matching items. The set of choices for matching items that is homogenous as to content. The benefit of a matching format is the measurement of understanding of a single learner’s outcome; the homogeneity of content is a characteristic of a set of matching items. Also, the number of choices should not equal the number of items. The seven rules for the matching format:

1. Provide clear directions to the students about how to select an option for each stem.
2. Provide more stems than choices.
3. Make choices homogenous.
4. Put choices in logical order or numerical order.
5. Keep the stems longer than the options.
6. Number stems and use letters for options (A, B, C).
7. Keep all items on a single page or bordered section of the page.

V. Testing for the Application of Knowledge and Skills in a Complex Task

33) Conventional multiple-choice examinations for certification examination are formed using many means. Most items range from multiple-choice formats to include conventional multiple-choice formats with content dependent graphic material. Conventional multiple choice with generic options, multiple true/false examinations, multiple response formations, net-word two tiered item sets, and combination multiple-choice items. These examples show that writing multiple-choice items can measure more than recall. Multiple-choice items often serve as good proxies for the performance items that require more time to administer in human scoring that is fraught with inconsistency and bias. The National Board Examinations item banks are hard to develop. These item banks must be updated each year. In most professions, these continuously evolve. Old items retire and new items must be replaced. Item writing in this context is expensive. Subject matter experts must be paid or may volunteer their valuable time. Regardless, the items must not only look good but they all must perform. The medical arena requires writing high quality multiple-choice items that attempt to get at this application of knowledge and skills. A high quality item that is typically encountered in certification tests for medical specialties gives a vignette of useful and un-useful information with a series of four or five options.

Item generation. At, specific assignments will be provided to the item writer with specific references used for each item. However, there are some general generic item shells. These item shells are hollow item containing a syntactic structure that is useful for writing sets of similar items. Item shell is a generic multiple-choice test item. All item shells are derived from existing items that are known to perform as expected. A simplistic item shell is:

What is an example of ‘any concept’?
A) Example
B) Plausible non-example
C) Plausible non-example
D) Plausible non-example

Other item stem shells include:

What is the definition of?
What is the best definition of?
What is the meaning of?
What is synonymous with?
Which is like?
Which is characteristic of?
What distinguishes?
Which is the reason for?
Which is the cause of?
What is the relationship between…and…?
What is an example of principle of?
What would happen if?
What is the consequence of?
What is the cause of?
Which is the most or least important significant, effective…?
Which is better, worse, higher, lower, farther, nearer, heavier, lighter, darker etc.?
Which is most like, least like?
What is the difference between… and…?
What is a similarity…and…?
Which of the following principles best applies to…?
Which of the following procedures best applies to the problem of…?
What is the best way to?
How should one…?
Which is the best definition of?
What are the main symptoms of?
What is the most common cause or ‘symptom’ of a ‘patient problem’?
Patient illness is diagnosed, which treatment is likely to be the most effective?
Information is presented about a patient problem, how should the patient be treated?
Successful performing items have the following consistency:
The type of cognitive behavior represented by the item is identified. The content that the item tests is identified.

A series of item writing steps must be followed:

1. Identify the stem of a successfully performing test item.
2. Underline the key words or phrase representing the content of that item.
3. Identify variations for each key word or phrase.
4. Select an age, trauma, injury, or complication and the type of accident from personal experience; write the stem and the correct answer.
5. Write the required number of distractors or as many plausible distractors as you can with a limit of four in developing multiple choice item modeling for clinical vignettes. Several faceted dimensions exist for the development of vignettes for the clinical setting.

The first facet includes the setting.
A) unscheduled patients/clinic visits
B) scheduled appointments
C) hospital rounds
D) emergency department

The second facet includes the tasks performed by the physician or acupuncturists including?
A) Obtaining history and performing physical examinations.
B) Using laboratory diagnostic data
C) Formulating most likely diagnosis
D) Evaluating the severity of the patient’s problem
E) Managing the patients
F) Applying scientific concepts

The final facet of the vignette includes the Case Cluster. This includes
1A. Initial work-up of a new patient/new problem.
1B. Initial work-up of a known patient with a new problem.
2A. Continued care of a known patient/old problem.
2B. Continued care of a known patient/worsening old problem.
3. Emergency care.

Presenting a problem in a professional training is the measurement of problem solving ability that is the part of most professional practice. The professional in practice when encountering a patient with a problem must engage in a complex thought process that leads to successful resolution of the patient’s problem. The first step to engaging this process from a testing procedure was developed in Canada as a key feature as a vignette because it helps discriminate among candidates with varying degrees. Unlike the previous item model where many features are identified, the objective in this key feature model is to identify those features that are most likely to discriminate among candidates with varying degrees of competence. A key feature problem usually has a brief stem followed by several questions requesting actions from the candidate being tested. The test items may be short answer or short menu, which involve choosing an answer from a long list of possible right answers.

Steps in developing the key feature problems.

1. Define the domain of clinical problems to be sampled. The domain of problems has a list of patient complaints and a list of correct diagnoses. Note that the emphasis is placed here on defining clearly and specifically the problems, complaints, and diagnoses. For example, the problem might be encountered near drowning, enuresis, dehydration, gomeilier nephritis, adolescence, diabetes, and/or foreign body aspiration. Any resulting test is a representative test is a representative sample from this domain.

2. Provide examination blueprint. Once the domain is identified the test specifications typically help in selecting items for a test. In this instance, it is used to select the problems from the domain of clinical problems. These can refer to many relevant factors such as medical specialty, body systems, clinical setting, etc.

3. Present clinical situations. Each problem can be presented in various ways. There are reportedly five clinical situations that can be identified:
A) Undifferentiated problems or patient complaints.
B) A single typical or atypical problem.
C) A multiple problem or multiple system involvement.
D) Life threatening problem.
E) Preventative care and health promotion.

4. Select key features for each problem. A key feature is a critical step that will likely produce a variety of different choices by the physicians. Some of these choices will be good for the patient and some of these choices will not be good. Not all patient problems will necessarily have the key features. The key feature must be difficult or likely produce a variety of effective or ineffective choices. Although the key features identified by one expert, some subject matter experts have to agree about how critical the key feature is. Key features are very firm 1-5 for each problem. Each key feature has initial information and an assigned task.


Problem 1 For Associated Key Features
For a pregnant woman experiencing third trimester bleeding with no abdominal pain, the physician should?
A) Generate placenta privia as a leading diagnosis
B) Avoid performing a pelvic examination that may cause fatal bleeding
C) Avoid discharging from an outpatient clinic or emergency department
D) Order coagulation test and cross-match

Problem 2 Three Associated Key Features For an adult patient complaining of a painful, swollen leg, the physician should? A) Include deep vein thrombosis in the differential diagnosis B) Elicit risk factors for deep vein thrombosis through the patient’s history C) Order a veinogram as a deterrent, definitive test for deep veinous thrombosis D) Select a case or right case scenario, referring back to the five clinical situations stated in step 3, the developer of the problem selects the clinical situation and writes the scenario. E) The scenario contains all of the relevant information and includes several questions in the multiple choice format guidelines.

VI. Develop scoring for the Results

Scoring keys are developed that have a single or multiple right answers. In some instances, candidates can select from a list where some other choices are correct or incorrect. Subject matter expert committee develops scoring writing and scoring rules for each case scenario.

VII. Conduct Pilot Testing

In high stakes tests, such as national board examinations, pilot testing is critical. This information is used to validate the future use of the case in formal testing situations.

VIII. Set Standards

As with any high stakes national board examination, a pace failed decision standards should be set. Key features item example:

A 56-year old male consults you in an outpatient clinic because of pain in his left leg, which begun two days earlier and getting progressively worse. He states his leg is tender below the knee around the ankle. He has never had similar problems and his other leg is fine.

Question 1
What diagnosis would you consider? List up to three.
Question 2
With respect to your diagnosis, what elements of history would you particularly want to elicit? Select up to 7.
Multiple Choice Answers Are:
A) Activity at the onset of symptoms
B) Alcohol intake
C) Allergies
D) Engine of pectors
E) Anti-inflammatory therapies
F) Cigarette smoking
G) Color of stools
H) Headache
I) Hematemasis
J) Hormone therapy
K) Impotence
L) Intermittent caudication
M) Low back pain
N) Nocturia
O) Palpitations
P) Parasthesias
Q) Proximal nocturnal dysmia
R) Polydypsia
S) Previous knee problems
T) Previous back problem
U) Previous neaplasma
V) Previous urinary tract infection
W) Current dental procedure
X) Recent immobilation
Y) Recent sore throat
Z) Recent surgery
AA) Recent work environment
BB) Wounds on foot
CC) Wounds on hand
DD) Recent dental procedure

The benefits of the key feature item are that first the patient problems are chosen on the basis of criticality. Each is known to precede a difficult discriminating key feature that will differentiate among different candidates for licensure of varying ability. The key feature problems are short so that many can be administered because reliability is the primary type of validity evidence. The key feature item should be numerous and highly interrelated [or correlated]. There is no restriction or limitation to the item format.