Concepts of Validity and Reliability in Research

Words: 1177 Pages: 4

Introduction

Validity refers to the degree to which counts from a measurement stand for the variable they are meant for. Conversely, the term reliability denotes the constancy of an investigation or test. These two concepts go hand in hand during research and contribute to the credibility of the investigation. This paper describes four types of validity and reliability as they apply to research.

Types of Validity

Face Validity

Face validity is a measure of the authenticity of a research undertaking at face value. It determines whether the project is good or bad by examining the measures used to conclude the project. For example, it is expected that qualitative research on self-esteem should include questions about the interviewee’s perception of their self-worth. Consequently, a questionnaire containing such questions is likely to have good face validity. In contrast, attributes such as the length of toes appear irrelevant in matters to do with self-esteem, resulting in poor face validity. In most cases, face validity is determined informally.

It is not a strong indicator that a measure is performing as intended because it is influenced by human behavior and intuition, which is not always right. For these reasons, many well-known measures in social sciences work well even though they do not have face validity.

Content Validity

Content validity determines the degree to which a measure encompasses the concept under investigation (Macdonald, 2015). It considers factors such as measuring the correct things using an appropriate number of samples. For example, the sample size should be statistically representative of the population. Additionally, the correct target groups should be used in research. For example, considering specific age groups or subjects possessing specific characteristics. Content validity is related to the design of an experiment. When using questions to obtain specific data from respondents in a study, a question with high content validity should elicit answers that directly address the research question.

Construct Validity

Construct validity is described as a measure of hypothetical concepts of cause and effect that truthfully correspond to the real-life scenarios they are meant to show (Gumley, Taylor, Schwannauer, & MacBeth, 2014). Well-designed experiments should convert theoretical ideas into tangible, measurable things. Therefore, construct validity evaluates the quality of the experimental design. Experiments that lack construct validity are likely to lead to wrong conclusions. There exist four additional variants of construct validity: convergent validity, discriminant validity, nomological network, and multi-trait multimethod matrix (MTMM).

Convergent validity happens when there is an agreement between measures of constructs that are supposed to correlate. Conversely, discriminant validity occurs when there is no correlation between constructs that are supposed to dissociate, thereby facilitating their differentiation (Henseler, Ringle, & Sarstedt, 2015). The nomological network elucidates associations between constructs and between subsequent measures. There should be a connection between the measures and resultant observations. MMTM uses various approaches such as surveys, tests, and observations to quantify similar attributes and demonstrate correlations in a matrix to ascertain construct validity.

Criterion Validity

Criterion validity measures the limit to which results on a measure are related to other parameters that would ordinarily be expected to have a relationship with (Heale & Twycross, 2015). These parameters are known as criteria and can be anything that is thought to have a relationship with the construct of interest (Cheung & Lucas, 2014).

For example, it is presumed that anxiety has a negative effect on performance in most cases. Therefore, in a study that measures anxiety and student scores in a test, it would be expected that anxiety scores would have a negative correlation with test scores. If an investigation revealed that high anxiety scores corresponded to low test scores, the outcomes would be an indication that the anxiety scores were a true representation of the levels of anxiety. However, if the outcomes were the other way around, the reliability of the anxiety measure would be questionable. Concurrent validity occurs when the criterion and construct are measured simultaneously. In contrast, if the construct is assessed following the measurement of the construct, the resultant construct validity is referred to as predictive validity.

Types of Reliability

Inter-Rater or Inter-Observer Reliability

This type of reliability determines the level of agreement between scores or measures produced by different people. It is also referred to as inter-coder reliability (Lampe, Mulder, Colins, & Vermeiren, 2017). Using humans during the measurement procedure poses a risk of inconsistencies in the data, which is attributed to human error and predisposition to distraction, particularly when doing repetitive tasks.

Therefore, it is necessary to establish inter-rater reliability in a setting or context that is different from the one under investigation. This process is akin to “standardizing” the raters. In such a case, each rater is required to rate a specified number of entities. Inter-rater reliability is then determined by computing the percentage agreement between the raters. When handling continuous variables, inter-observer reliability can be determined by establishing the correlation between the scores of the raters.

Test-Retest Reliability

Test-retest reliability is a way of determining the external constancy of a test. It is usually conducted when the same test is administered to the same group of respondents at varying times. The key assumption is that the concept being measured does not undergo substantial changes in the course of the time gap. However, the time between the two sets of measurements plays a crucial role because the likelihood of disparities is higher when the time gap between the two measurements increases, and vice versa (Polit, 2014). Therefore, the correlation between measures is higher when a shorter time has elapsed than after a longer interval.

Additionally, when measurements are taken closer in time, fewer factors contribute to discrepancies. Some of the common factors that can affect a subject’s response to a test include the time of day, mood, and food intake among others. A test with good test-retest reliability should manage such factors without leading to substantial variations.

Parallel-Forms Reliability

Parallel form reliability determines the most appropriate test to use to evaluate a given concept. It is achieved by conducting two tests on the same subjects on separate occasions (Yarnold, 2014). The test that yields consistent outcomes are considered the best tool.

Internal Consistency Reliability

Internal consistency reliability appraises single questions and compares them with one another for their capacity to produce correct outcomes consistently (Heale & Twycross, 2015). Individual questions (or measurement tool) is administered to a group of people to evaluate its dependability. The reliability of the instrument is judged by assessing the degree to which items indicate the same concept to produce similar outcomes. Different internal consistency measures can be used, including average inter-item correlation, average item-total correlation, and split-half correlation. Average inter-item correlation contrasts the relationships between all sets of questions that appraise the same concept by estimating the average of all paired correlations.

Conversely, average item-total correlation computes the total score for each inter-item correlation means and determines the average of these total scores. Split-half correlation separates items that gauge the same concept into two tests, applies them to the same group of people, and finds the correlation between the two pairs of total scores.

References

Cheung, F., & Lucas, R. E. (2014). Assessing the validity of single-item life satisfaction measures: Results from three large samples. Quality of Life Research, 23(10), 2809-2818.

Gumley, A. I., Taylor, H. E. F., Schwannauer, M., & MacBeth, A. (2014). A systematic review of attachment and psychosis: Measurement, construct validity and outcomes. Acta Psychiatrica Scandinavica, 129(4), 257-274.

Heale, R., & Twycross, A. (2015). Validity and reliability in quantitative studies. Evidence-Based Nursing, 18(3), 66-67.

Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115-135.

Lampe, K. G., Mulder, E. A., Colins, O. F., & Vermeiren, R. R. (2017). The inter-rater reliability of observing aggression: A systematic literature review. Aggression and Violent Behavior, 37, 12-25.

Macdonald, S. (2015). Essentials of Statistics with SPSS (3^rd Ed.). Morrisville, NC: Lulu.com.

Polit, D. F. (2014). Getting serious about test-retest reliability: A critique of retest research and some recommendations. Quality of Life Research, 23(6), 1713-1720.

Yarnold, P. R. (2014). How to assess the inter-method (parallel-forms) reliability of ratings made on ordinal scales: Emergency severity index (version 3) and Canadian triage acuity scale. Optimal Data Analysis, 3(4), 50-54.