Is Retirement Good for Your Health? A Systematic Review
Is Retirement Good for Your Health? A Systematic Review
For the purpose of the present review, a literature search for peer-reviewed publications was conducted by two authors (RR and SR) in PubMed, Embase and Web of Science up to November 11, 2013. The keywords that were used referred to the exposure (retirement), outcome (health-related) and study design (longitudinal designs) (See Table 1). Only studies published in English were included. Based on the title and abstract, two reviewers (RR and SR) independently selected articles for full text analysis. For final inclusion the articles had to fulfil all of the following criteria: I) the study had to utilise either a prospective or retrospective longitudinal design; II) the study had to involve a non-patient population that did not retire due to health-problems/receive a disability pension; III) the study should report on generic measures of health, such as mental health, perceived general health or physical health before and after retirement. This meant that studies that merely compared retirees with a control group, as was the case in the studies of Bonsang and colleagues and Behncke, were excluded from this review. A consensus method was used to resolve disagreements. Finally, the references of all included studies were checked for other possibly relevant articles.
One reviewer (IH) extracted the relevant data from the selected publications. The study characteristics extracted were target population (setting, age, sex), sample size, follow-up duration, assessment of retirement, type and measure of health outcomes, and key findings. In the case of uncertainty about the extracted data from the included studies, a second reviewer (RR) was consulted. Pairs of authors (from IH, KP and RR) independently scored the quality of each study according to a standardised set of 14 predefined criteria (Table 2). These criteria distinguished between informativeness (I, n = 4) and validity/precision (V/P, n = 10). Each quality criterion was rated as positive (+), negative (-), or unknown (?) as clarified in Table 2. In the case of an unclear or incomplete description of an item, a question mark was assigned, and the first author of the publication was contacted by e-mail to obtain additional information. Scoring agreement was expressed in a percentage of the total number of items scored (n = 308). Disagreement in scores between reviewers was resolved in a consensus meeting. If, after discussion, an agreement could not be reached, a third author (KP or RR) was consulted in order to reach a final conclusion. The total quality score was assigned by counting the number of items scored positively on the validity/precision criteria (V/P). Studies with a minimum of six points (>50%) were regarded as high quality.
The collected data from the included studies was pooled when possible—in cases where there was enough homogeneity and if data was available from three or more studies. Homogeneity was assessed based on the type of health outcome and the type of measure of this health outcome. Pooling was done by calculating the mean differences (SD) based on percentages (percentage before retirement minus percentage after retirement) and by calculating the effect sizes (mean difference/SD). 95% confidence intervals around the mean differences were calculated based on t-distributions. For studies on perceived general health, only the prevalence of good general health and poor general health were included in the pooling (the prevalence of average health or an equivalent was not included). Evidence from all included studies was summarised by using a best evidence synthesis, based on results from both high and low quality studies. The best evidence synthesis consists of three levels:
Results of the studies reporting on a particular relationship were considered consistent when for at least 75% of the study results were in the same direction, as defined by p < 0.05.
Methods
Search Strategy and Study Selection
For the purpose of the present review, a literature search for peer-reviewed publications was conducted by two authors (RR and SR) in PubMed, Embase and Web of Science up to November 11, 2013. The keywords that were used referred to the exposure (retirement), outcome (health-related) and study design (longitudinal designs) (See Table 1). Only studies published in English were included. Based on the title and abstract, two reviewers (RR and SR) independently selected articles for full text analysis. For final inclusion the articles had to fulfil all of the following criteria: I) the study had to utilise either a prospective or retrospective longitudinal design; II) the study had to involve a non-patient population that did not retire due to health-problems/receive a disability pension; III) the study should report on generic measures of health, such as mental health, perceived general health or physical health before and after retirement. This meant that studies that merely compared retirees with a control group, as was the case in the studies of Bonsang and colleagues and Behncke, were excluded from this review. A consensus method was used to resolve disagreements. Finally, the references of all included studies were checked for other possibly relevant articles.
Data Extraction and Quality Assessment
One reviewer (IH) extracted the relevant data from the selected publications. The study characteristics extracted were target population (setting, age, sex), sample size, follow-up duration, assessment of retirement, type and measure of health outcomes, and key findings. In the case of uncertainty about the extracted data from the included studies, a second reviewer (RR) was consulted. Pairs of authors (from IH, KP and RR) independently scored the quality of each study according to a standardised set of 14 predefined criteria (Table 2). These criteria distinguished between informativeness (I, n = 4) and validity/precision (V/P, n = 10). Each quality criterion was rated as positive (+), negative (-), or unknown (?) as clarified in Table 2. In the case of an unclear or incomplete description of an item, a question mark was assigned, and the first author of the publication was contacted by e-mail to obtain additional information. Scoring agreement was expressed in a percentage of the total number of items scored (n = 308). Disagreement in scores between reviewers was resolved in a consensus meeting. If, after discussion, an agreement could not be reached, a third author (KP or RR) was consulted in order to reach a final conclusion. The total quality score was assigned by counting the number of items scored positively on the validity/precision criteria (V/P). Studies with a minimum of six points (>50%) were regarded as high quality.
Data Analysis
The collected data from the included studies was pooled when possible—in cases where there was enough homogeneity and if data was available from three or more studies. Homogeneity was assessed based on the type of health outcome and the type of measure of this health outcome. Pooling was done by calculating the mean differences (SD) based on percentages (percentage before retirement minus percentage after retirement) and by calculating the effect sizes (mean difference/SD). 95% confidence intervals around the mean differences were calculated based on t-distributions. For studies on perceived general health, only the prevalence of good general health and poor general health were included in the pooling (the prevalence of average health or an equivalent was not included). Evidence from all included studies was summarised by using a best evidence synthesis, based on results from both high and low quality studies. The best evidence synthesis consists of three levels:
Strong evidence: consistent findings in multiple (≥ 2) high-quality studies;
Moderate evidence: consistent findings in one high-quality study and at least one low-quality study, or consistent findings in multiple low-quality studies;
Insufficient/conflicting evidence: only one study available/inconsistent findings in multiple (≥ 2) studies.
Results of the studies reporting on a particular relationship were considered consistent when for at least 75% of the study results were in the same direction, as defined by p < 0.05.