Achievement Tests Essay

Cheap Custom Writing Service

This Achievement Tests Essay example is published for educational and informational purposes only. If you need a custom essay or research paper on this topic, please use our writing services. offers reliable custom essay writing services that can help you to receive high grades and impress your professors with the quality of each essay or research paper you hand in.

An achievement  test (e.g., a bar exam) is an assessment  of one or more persons’ knowledge  or skills (typically  during  a specific period  of time). (In the remainder of this entry, “test” only refers to an  achievement   test.)  In  addition to  evaluating individuals’ competences,  tests can assess (a) their learning,  (b)  teachers’  impact  on  their  learning, and  (c) the  school  or  school  system’s effects  on their  learning.  Testing  can  also  influence individuals’  behaviors   before,  during,  and  after the exam. Designing suitable tests requires creating specific test  questions  (or  test  items)  and  testing them  with  advanced   statistics.   Country, family, school  system,  school,  schoolmate, teacher,  and individual characteristics all influence achievement test  scores.  Likewise, achievement  test  scores are linked to future academic performance, the graduation rate, job status,  and income.


Beginning   with   China’s  imperial   civil  service exam  (or  Keju  in Chinese)  in 206  b.c.e., people have used tests to select among  candidates based on their knowledge rather than on favoritism, nepotism,  or bribery. As tests typically enable the selection  of some candidates with  less wealth  or social status,  they highlight  the system’s openness to  others  beyond  a closed  elite. Hence,  this  test system  encourages  candidates to  view  it  as  fair and   based   on   merit,   resulting   in   its   greater legitimacy.

Tests can  assess a person’s  learning  during  the time  between  a  pretest  and  a  posttest   (posttest score minus pretest  score). To estimate  the impact of teachers on student  learning, evaluators can use the differences in students’  annual  test scores in a large, longitudinal data  set of many  teachers  and students   across  many  years  in  an  analysis  that controls  for the possible effects of the characteristics of students  and  their  families. Similarly, cities and   countries   can  use  annual   student   tests  to evaluate the effectiveness of their schools. Without extensive, longitudinal data and controls,  however, estimates  of the effectiveness of teachers,  schools, and school systems can be biased and misleading.

Testing   can   influence   individual    behaviors before,  during,  and  after  the  exam.  When  people are  informed   in  advance  about   an  exam  whose result  has  consequences  for  them,  they  are  more likely  to   prepare   for   the   exam   (by  studying, practicing,  etc.) and,  thereby,  perform  better  than otherwise. During a test, people who are concerned about   its  consequences   or   unfamiliar  with   its format  might  feel test  anxiety  and  thus  perform worse than otherwise. After an exam, its results can provide useful feedback about a person’s performance and inform a plan for further  study or instruction, thereby improving future performance. Thus, testing itself can change people’s behaviors.

Creating  Tests

Creating  tests that  accurately  assess knowledge  or skills requires  appropriate selection of content  for the  target  population and  a purpose  to  design  a suitable test for subsequent analysis. As most knowledge  or skills are specific to a domain  (e.g., geometry),  test  designers  must  select the  content that they will assess. Ideally, the targeted  content  is a coherent,  integrated set of knowledge  and  skills that   are  intimately   related   to  one  another (not just a list of disparate ideas and behaviors). If a test covers coherent content, its score can provide meaningful  interpretation about  a person’s competence  in that  content  area.

Tests are  also  designed  for specific populations and  purposes. For example,  a high school  graduation test of biology test items focuses on basic, general concepts.  In contrast, a biology test to inform the awarding of college scholarships  has more test items about  advanced  ideas and their relationships.

Last, a midterm biology test with open, specific questions  about  the human  respiratory system can help a high  school  teacher  assess students’  understanding   and  inform   her  or  his  teaching  of  the human  circulatory system  to  them.  Hence,  a  test must suit both the population and the purpose.

Test   designers   aim   to   create   tests   that   can be graded fairly and consistently at low cost—typically,  analytic  tests (rather  than  holistic tests) with objectively evaluated  test items. Holistic tests ask students  to address  a major  problem  or question   (e.g., What  causes  climate  change?),  to assess  participants’  executive   skills  (i.e.,  their ability to plan, organize,  integrate,  etc.). However, evaluators might  not  agree on a single score on a participant’s holistic test, which can raise questions about  its legitimacy. Hence, holistic tests are often scored along multiple dimensions (e.g., content, organization), using  rubrics   with  exemplars   for each score along each dimension  and with explicit boundaries between adjacent  scores.

Analytic tests have separate items covering different, specific knowledge (e.g., questions requiring     short     answers,     true/false     items, multiple-choice questions,  matching  answers to questions,  etc.).  If these  analytic  test  items  have clear, objective answers that can be clearly evaluated as correct or incorrect, they allow for transparency and  allow  stakeholders to view and  agree on the fair  evaluation of  participant responses,  thereby enhancing a test’s legitimacy. Short-answer questions,   true/false   items,   and   matching   have critical weaknesses compared with multiple-choice questions.   Evaluators   might   not   know   how   to score unexpected short answers. Meanwhile, students  who  do not  know  the answer  to a true/ false question  have a 50-percent  probability of guessing the correct  answer  but only a 20-percent probability for a multiple-choice question  with five possible answers. Hence, each multiple-choice question  is more  likely  than  a  true/false  item  to distinguish   among   students   of  different   competences. Meanwhile, multiple-choice questions often have several choices that are nearly correct, but matching  problems  cannot  include  a comparable number  of such  answers  without severely taxing participants’ short-term memories.

Tests must be inexpensive to design, administer, and evaluate. While holistic tests are easy to design and   administer,  they   are   costly   to   evaluate, requiring   extensive  time  to  prepare   a  rubric,  to train evaluators, and for them to grade the tests. In contrast, short-answer and multiple-choice questions   require   extensive  time  to  design  and slightly   more   time   and   cost   to   produce   and administer, but  they can be evaluated  quickly and correctly, especially multiple-choice questions, when  using  computers. For  a teacher  assessing  a classroom   of  students,   a  holistic  test  can  yield more  information at  low  design  and  administration  costs  and  at  tolerable  evaluation costs  than less informative tests  with  only  short-answer or multiple-choice  questions.   However,   tests   with short-answer or multiple-choice questions are preferable    when   evaluating   large   populations (e.g., for school entrance  exams). The remainder of this entry focuses on multiple-choice tests.

Test designers aim to create a bank  of multiplechoice  test  items  that   cover  the  target   content, range  in difficulty,  and  are  of high  quality.  Each test item evaluates a person’s knowledge  of specific target  content.  Typically,  each  test  item  has  one correct  answer,  and  the  other  choices  receive  no credit  (in  some  tests,  some  choices  can  receive partial  credit).  Furthermore, each  test  item  has  a specific level of difficulty.  Last, high-quality  items distinguish  reliably  between  participants who  are above and those who are below a specific level of competence   (i.e.,  those  who  can  vs.  those  who cannot  answer the questions  correctly). Ineffective, low-quality test items might be misunderstood, too easy, too hard, misleading, or too easy to guess correctly.

To evaluate the quality of the test items, they are bundled   into  multiple  tests  and  administered to people, whose responses are assessed. Each pair of tests has common  test items (anchors)  that  allow for scores on all tests to be calibrated to the same scoring scale. The people selected to take these preliminary tests should  have competences  similar to  the  target  test  population’s  range  of  competences.  For  example,  new  items  on  tests  like the ACT (originally,  American  College Testing)  or the SAT (Scholastic  Aptitude  Test)  are  introduced as experimental sections on tests given to current students.

Advanced statistical analyses of test responses estimate  the  competence  of  each  participant and the attributes of each test item. The competences  of the  participants  indicate   whether   the  test  items (or  a  subset   of  them)   cumulatively   serve  their function   of  distinguishing participants  from  one another along a single scale. For example, a scholarship test that  results in high scores for most participants is too easy, so the easy test items should be dropped or redesigned to be more difficult.

The estimated  attributes of each test item show its relative  alignment  with  the  target  content,  its difficulty level, its quality, its likelihood  of guessing success, and its bias against subsamples of participants (through factor analyses, item response test analysis, and differential item functioning analysis). First, the analysis determines whether the test  items  reflect  one  or  more  underlying   target content  competences. If most test items align along one  competence  with  a few items  aligning  along other competences, the latter items likely assess irrelevant competences and are discarded or revised (another  possibility   is  that   the   target   content requires substantial reconsideration). Second, items that  are much  easier or harder  than  expected  are recategorized,   revised,    or    discarded.    Third, high-quality   items  are  retained, and  low-quality items  are  revised or  discarded.  Fourth,  test  items with     high    rates     of    guessing    success    by low-competence participants are revised or discarded.  Last, among  subgroups of participants with  similar  competence  estimates  (e.g., males vs. females;  Asians  vs.  Latino),   test  items  that   are much  easier  for  one  group  than  for  another are revised, discarded, or tagged for use with only homogeneous samples (e.g., only females).

Influences On Test Scores

Country, family, school, and individual  characteristics influence test scores. Countries that  are richer or more equal have higher test scores. People in countries  with  higher real gross domestic  product per capita (e.g., Japan) often capitalize on their country’s greater resources to learn more and score higher on tests. Furthermore, countries  with a less equal  distribution of family  income  often  experience diminishing  marginal  returns;  a poor  student likely learns more from an extra  book  than  a rich student   would.   Thus,   in  more   equal   countries (e.g.,  Norway), poorer  students  often  have  more resources  and benefit more from them than  richer students,  resulting  in higher  achievement  and  test scores overall in these countries.

Some family members (e.g., parents) provide family resources, but others (e.g., siblings) compete for  them.  Children  in families  with  more  human capital  (e.g., education), financial  capital  (wealth), social   capital   (social   network  resources),   and cultural capital (knowledge of the dominant culture and  society)  often   use  these  resources   to  learn more. When a person  has more siblings (especially older ones), they compete for these limited resources, resulting  in less use of shared  resources,  less learning, and lower test scores (resource  dilution).

Students from privileged families often attend schools with privileged schoolmates, large budgets, and   effective  teachers.   Privileged   schoolmates’ family capital, material resources, diverse experiences, and high academic expectations often help a student learn more and score higher on tests. These schools  often  have larger  budgets,  better  physical conditions, and more educational materials,  which can improve their students’ learning and test scores compared with  those  of  other  schools.  Students from privileged families often benefit from attending  schools  with  higher  teacher-to-student ratios and   better-qualified  teachers.   Superior   teachers often maintain better  student  discipline and better relationships with  their  students—both of which are linked to higher student  achievement.

Studies   of  school   competition  show   mixed results.  Some natural experiments  suggest that  in schools facing greater  competition (there are more schools in some districts because of natural phenomena such  as rivers),  students  have  higher test scores. When school closures are anticipated or announced, their  students  have  lower  test  scores, but surviving schools show higher test scores. Meanwhile, studies  of school  choice and  of traditional versus charter  schools show mixed results.

Student genes, cognitive ability, gender, attitudes, motivation,  and   behaviors   also   influence   test scores. Genes contribute to student cognitive ability and test scores, but studies of separated twins and siblings suggest that  genetics account  for less than 15 percent of the differences in people’s test scores. Girls outperform boys on school tests at every age level in nearly  every subject,  in part  because  girls have better  attitudes toward school, feel a greater sense of belonging  at school, are more  motivated, attend  school more regularly,  and study more and suffer from fewer behavioral discipline problems— all  of  which   are   linked   to   higher   test   scores.

Girls also outperform boys on standardized reading tests, but boys score higher on standardized mathematics  tests.  This  latter   result  stems  from school tests’ ceiling effects on boys with high mathematics  ability  and  from  girls having  greater  test anxiety during consequential, standardized tests.


Test scores are linked to future test scores, graduation  rates,  further  study,  and  better  jobs.  People with  higher  test  scores  tend  to  score  higher  on future tests (with some regression to the mean). As graduation and  further   study  largely  depend  on academic   performance,  students   with   high  test scores are more likely to graduate from school and more  likely to  pursue  higher  degrees. Those  who graduate from  college  or  have  advanced  degrees have  higher-status jobs  and  earn  higher  incomes. However,   these  relationships  weaken  over  time. For  example,  people  with  high  mathematics test scores one year are likely to have high mathematics test  scores  the  next  year  too;  however,  they  are only  somewhat  more  likely  to  earn  more  than others  in 10 years’ time.


  1. Baker, Frank and Seock-Ho Kim. Item  Response Theory.  Boca Raton,  FL: CRC Press, 2004.  Chiu, Ming Ming. “Inequality, Family, School, and Mathematics Achievement.” Social Forces, v.88/4 (2010).
  2. Chiu, Ming Ming and Robert Klassen. “Relations of Mathematics Self-Concept  and Its Calibration With Mathematics Achievement.” Learning  and Instruction, v.20 (2010).

See also:


Always on-time


100% Confidentiality
Special offer! Get discount 10% for the first order. Promo code: cd1a428655