Reliability, on the other hand, is not at all concerned with intent, instead asking whether the test used to collect data produces accurate results. In this context, accuracy is defined by consistency (whether the results could be replicated).

The property of ignorance of intent allows an instrument to be simultaneously reliable and invalid.

Returning to the example above, if we measure the number of pushups the same students can do every day for a week (which, it should be noted, is not long enough to significantly increase strength) and each person does approximately the same amount of pushups on each day, the test is reliable. But, clearly, the reliability of these results still does not render the number of pushups per student a valid measure of intelligence.

Because reliability does not concern the actual relevance of the data in answering a focused question, validity will generally take precedence over reliability. Moreover, schools will often assess two levels of validity:

  1. the validity of the research question itself in quantifying the larger, generally more abstract goal

  2. the validity of the instrument chosen to answer the research question

See the diagram below as an example:


The validity of an instrument is the idea that the instrument measures what it intends to measure.

Validity pertains to the connection between the purpose of the research and which data the researcher chooses to quantify that purpose.

For example, imagine a researcher who decides to measure the intelligence of a sample of students. Some measures, like physical strength, possess no natural connection to intelligence. Thus, a test of physical strength, like how many push-ups a student could do, would be an invalid test of intelligence.

Differences Between Validity and Reliability

When creating a question to quantify a goal, or when deciding on a data instrument to secure the results to that question, two concepts are universally agreed upon by researchers to be of pique importance.

These two concepts are called validity and reliability, and they refer to the quality and accuracy of data instruments.

Schools all over the country are beginning to develop a culture of data, which is the integration of data into the day-to-day operations of a school in order to achieve classroom, school, and district-wide goals. One of the biggest difficulties that comes with this integration is determining what data will provide an accurate reflection of those goals.

Such considerations are particularly important when the goals of the school aren’t put into terms that lend themselves to cut and dry analysis; school goals often describe the improvement of abstract concepts like “school climate.”

Schools interested in establishing a culture of data are advised to come up with a plan before going off to collect it. They need to first determine what their ultimate goal is and what achievement of that goal looks like. An understanding of the definition of success allows the school to ask focused questions to help measure that success, which may be answered with the data.

For example, if a school is interested in increasing literacy, one focused question might ask: which groups of students are consistently scoring lower on standardized English tests? If a school is interested in promoting a strong climate of inclusiveness, a focused question may be: do teachers treat different types of students unequally?

These focused questions are analogous to research questions asked in academic fields such as psychology, economics, and, unsurprisingly, education. However, the question itself does not always indicate which instrument (e.g. a standardized test, student survey, etc.) is optimal.

If the wrong instrument is used, the results can quickly become meaningless or uninterpretable, thereby rendering them inadequate in determining a school’s standing in or progress toward their goals.


On the other hand, extraneous influences relevant to other agents in the classroom could affect the scores of an entire class.

If the grader of an assessment is sensitive to external factors, their given grades may reflect this sensitivity, therefore making the results unreliable. Assessments that go beyond cut-and-dry responses engender a responsibility for the grader to review the consistency of their results.

Some of this variability can be resolved through the use of clear and specific rubrics for grading an assessment. Rubrics limit the ability of any grader to apply normative criteria to their grading, thereby controlling for the influence of grader biases. However, rubrics, like tests, are imperfect tools and care must be taken to ensure reliable results.

How does one ensure reliability? Measuring the reliability of assessments is often done with statistical computations.

The three measurements of reliability discussed above all have associated coefficients that standard statistical packages will calculate. However, schools that don’t have access to such tools shouldn’t simply throw caution to the wind and abandon these concepts when thinking about data.

Schillingburg advises that at the classroom level, educators can maintain reliability by:

  • Creating clear instructions for each assignment

  • Writing questions that capture the material taught

  • Seeking feedback regarding the clarity and thoroughness of the assessment from students and colleagues.

With such care, the average test given in a classroom will be reliable. Moreover, if any errors in reliability arise, Schillingburg assures that class-level decisions made based on unreliable data are generally reversible, e.g. assessments found to be unreliable may be rewritten based on feedback provided.

However, reliability, or the lack thereof, can create problems for larger-scale projects, as the results of these assessments generally form the basis for decisions that could be costly for a school or district to either implement or reverse.



Leave a Reply

Your email address will not be published. Required fields are marked *