2.1. Study Design
This study employs real-world data design to integrate observational data extracted from multiple sources, including information obtained from different providers based on surveys and clinical, epidemiological, population, and environmental registries. The surveys have an overlapping panel design to ensure there are both cross-sectional and longitudinal estimates [54] and to include population-based probability samples carried out via telephone interviews.
2.2. Geographical, Population, and Temporal Scopes
The geographical scope is the Autonomous Region of Andalusia (Comunidad Autónoma de Andalucía), Spain, and the population scopes are the general population over the age of 16 (ESSOCgeneral), the population residing in disadvantaged areas (ESSOCzones)[55], and the population over the age of 55 (ESSOC+55). Collective households (i.e., hospitals, nursing homes, barracks, etc.) are not considered in this study. That said, the study sample includes families who, as an independent group, reside in these collective establishments (e.g., a director or janitor of a centre). The temporal scopes of each sample (Figure 1) are:
- ESSOCgeneral: five measurements taken between 2020 and 2023, at baseline (beginning of the Spanish State of Alarm), at one month from the first interview, at six months, at 12 months, and at 36 months.
- ESSOCzones: two measurements, taken at baseline (12 months from the beginning of the State of Alarm) and at 24 months from the first interview.
- ESSOC+55: two measurements, taken at baseline (six months from the beginning of the State of Alarm) and at 30 months from the first interview.
2.3. Sampling Frame
The sampling frame used to extract the ESSOCgeneral and ESSOCzones samples is obtained from the Longitudinal Andalusian Population Database (BDLPA, Base Longitudinal de Datos de Población de Andalucía)[56]. The information consolidated in the BDLPA originates from the integration of information on stocks, flows, and variations extracted from the census coordination system which, together with the data obtained from the Civil Registries with respect to births, deaths, and marriages (i.e., vital statistics [MNP, Movimiento Natural de la Población]), as well as that reported in the population and housing censuses, give rise to an integrated longitudinal frame for population and territorial statistics in Andalusia[57]. The sampling frame is extracted from the BDLPA longitudinal file as a cross-section with the reference date set as 01 January of 2019. The selected samples are linked to the information obtained from the User Database (BDU, Base de Datos de Usuarios)[58] of the Andalusian Public Health Care System in order to obtain the telephone numbers of the selected sample units. The BDU coverage in terms of contact telephone numbers for the selected samples is usually above 96%. On the other hand, the ESSOC+55 sampling frame corresponds to the user population of the Andalusian Guadalinfo public network aged 55 years and over [59].
2.4. Sample Size
During the first measurement (M1), the ESSOCgeneral scope started with a random sample comprised of 5,000 people under the assumptions of maximum variability in the estimate (p = q = 0.5), a design effect of 1.8, a precision of 2.4 percentage points for estimates in Andalusia, a confidence level of 95%, and a non-response rate for the theoretical sample of 40%. The subsequent measurements (M2–M5) are comprised of the longitudinal samples of the previous measurements (n_i_theoretical_longitudinal) and, in addition, of a new sample in each measurement (n_i_theoretical_new). That new sample is selected according to the design of the first measurement, except for M5 which will incorporate a new stratum of ‘residing or not in disadvantaged areas’, where DA1 = non-disadvantaged area and DA2 = disadvantaged area[60]. For the previous measurements (M1, M2, M3 and M4), a post-stratification will be carried out according to disadvantaged areas in order to improve the estimates. Finally, due to non-response, from these two theoretical samples (longitudinal+new) we obtain the total effective sample for these measurements (n_i_effective = n_i_effective_longitudinal + n_i_effective_new). With respect to the theorical sample size for the longitudinal sample of i measurement, this is defined by n_i_theoretical_longitudinal = n_i-1_effective_longitudinal + n_i-1_effective_new, i=2…5). The aim is to reach an effective sample of 3,000 units per measurement for ESSOCgeneral, 2,750 for ESSOCzones and at least 2,400 for ESSOC+55, assuming a total response rate of 60%.
Thus, the ESSOC is made up of a series of measurements broken down into a new sample and a longitudinal sample for each measurement which, in turn, are categorized into theoretical and effective samples for each population study group (general, zones and +55), with a total of over 22,000 effective interviews being carried out over three years.
2.5. Sample Allocation
Allocation of the new samples (including the first sample) is mixed. On the one hand, they are uniform by province (150 sample units for eight provinces) and, on the other hand, proportional to the population size of each province and degree of urbanization (urban, intermediate‑density, and rural area)[61]. In addition, for the M5 measurement, the distribution of the new sample will be performed, in the first place, in the two DA strata and, subsequently, as in the case of the previous measurements.
2.6. Sample Selection
The selection of the new theoretical sample in each measurement is carried out in a simple random manner within each province and degree of urbanization stratum and, in the case of measurement M5, within each DA strata, thus obtaining self-weighted samples in each stratum. The theoretical longitudinal sample of a measurement is composed of the effective sample of the previous measurement, except for measurement M1 which, being the first one, does not have a longitudinal sample.
In the ESSOC+55 sample, the first measurement was stratified by clusters (Guadalinfo Centers, N = 651), with sub-sampling to 1,200 users. These centres are stratified per Andalusian province and inhabitation level (<10,000 inhabitants, 10,000–19,999, ≥20,000), as well as sex and age quotas (55–64 years, 65–74 years, and >75 years). As in the case of the ESSOCgeneral sample, the ESSOC+55 second measurement will be made up of a longitudinal sample (that of the first measurement) and another new sample until a total of 1,200 interviews are completed.
2.7. Fieldwork
The survey information is collected through a computer-assisted telephone interview (CATI). The management, control, and monitoring of the collection of information for measurement M1 is carried out through Pl@teA, the IECA’s survey collection platform, whereas for the rest of measurements these tasks are performed using the Mobinet-Gandia Integra software. The data collection is carried out by a team of between eight (in M1) and 12 (rest of ESSOCgeneral measures) interviewers assigned solely to this study. This ensures working team stability, which is of fundamental importance in regular, longitudinal surveys. Before starting the study, the interviewers will receive the necessary training regarding the content of the survey. To this end, in addition to virtual meetings held before starting the fieldwork, they are provided with interviewer’s and questionnaire manuals which, besides explaining the questionnaire and study’s content, describe the survey platform, possible incidences, and the protocol to be followed in each case to guarantee the maximum quality of the samples and the information collected. Prior to engaging in the fieldwork, each interviewer performs several pilot tests to measure times and determine the complicated points of the questionnaire.
The schedule set to conduct the surveys is from Monday to Friday from 10:00 to 21:00, and on Saturdays from 10:00 to 15:00 for the first measurement and for the rest of the measurements from Monday to Saturday from 14:00 to 21:00, although deferred appointments can be scheduled without time limits. Furthermore, a telephone line with a 900 prefix and staffed by telephone agents is made available to the public. This line is provided to the survey holders via text message or through the CSyF website, where the characteristics of this study are also published. This call centre also receives calls from people who, having been contacted by CATI agents, need to confirm the official nature of the survey. In fact, many of these calls become completed interviews.
2.8. Quality Control
For the ESSOCgeneral, quality control measures mean the interviews are cross-checked both internally in the Call Centre itself and in the IECA and EASP. Each interviewer is monitored to ensure that they follow the established protocols and that they use each type of incidence correctly. The intervals elapsed between each call and their duration are also monitored.
In addition to recording the calls made by each interviewer, a listening check is also performed to review both the positive aspects and those to be reinforced in the supervised surveys (i.e., for 10% of the calls performed for the first measurement and 25% of the calls performed for the rest of the measurements). During these checks, aspects, such as the interviewer’s self-presentation, their presentation of the study, the offer of being able to call the 900-prefix telephone number, the confirmation of the place of residence, the correct delivery of the questionnaire’s questions and all response options, are assessed.
Quality control, data cleansing, and data coding are carried out simultaneously with the fieldwork with the aid of the software to be used in the study. Each interviewer is provided with a space on the platform to record observations during the survey being conducted. This then allows the supervision team to cleanse those interviews in which the interviewer detected an inconsistency in the respondent’s answers, or those in which the interviewer made a mistake when completing the questionnaire. Likewise, the values of the variables are revised, and invalid ones cleansed. Moreover, the coding of the variables corresponding to open-ended questions, such as the respondent's occupation and educational level, is carried out in tandem with the fieldwork. In the rest of open‑ended questions, prior to their coding, their possible answers are cleansed, and the categories deemed to correspond to the majority subsequently coded.
During the telephone surveys, different situations may arise that could result in the inability to complete the survey. This is known as field incidences (Table 2), with the most important types being final incidences, i.e., those that, after several attempts, finally result in the inability to complete the survey.
Table 2. ESSOC: Interview incidences and protocol to be followed:
Incidence
|
Incidence
|
Description
|
Protocol
|
Frame incidence (reasons that make it impossible to complete the survey due to problems related to the sampling frame; for example, a telephone number with which to contact the sample person could not be obtained or the housing frame was not sufficiently up to date)
|
The telephone number does not exist
|
Wrong number: the telephone number dialled does not exist, corresponds to a fax, or has restricted calls.
|
Direct removal
|
Not contactable
|
Out-of-date frame: a selected person is living in a different municipality, a telephone frame without a telephone number, a person unreachable through the telephone number/home address provided due to circumstances such as death, divorce/separation, etc.
|
Direct removal
|
Relationship-situation incidences (reasons that make it impossible to complete the survey due to several types of situations affecting the surveyed people, for instance, they cannot be located, they refuse to participate in the survey, or any other aspect that prevents the survey being conducted)
|
No contact
|
The household cannot be contacted (e.g., nobody answers the telephone, or the answering machine goes off)
|
Removal after four attempts performed on two different days, at two different times
|
Absent
|
The selected person cannot be contacted
|
Removal after four attempts performed on two different days and at two different times
|
Inability to answer
|
The selected person cannot complete the survey due to an inability to respond to it because of disability, age, illness, lack of knowledge of the language, or any other circumstance. If possible, the survey should be completed by a close relative.
|
Direct removal
|
Refusal
|
The selected person refuses to complete the survey or refuses to continue it after it has begun.
|
Direct removal
|
As this is a longitudinal study, one of the most significant reasons for a lack of response is the potential interviewee identifies the incoming call number and does not answer the phone. To solve this, as much as possible, the telephone number from which each call is made is changed periodically, so that even if a number were to identified and blocked, we could continue to attempt to contact that person by employing a new telephone number.
In addition to the quality of the sample, there are other factors of interest in assessing how fieldwork has developed during a surveying operation. One such factor is to determine how q survey has been carried out in terms of effectiveness and efficiency. The most direct way to measure this is to calculate the number of attempts or calls that had to be made in order to complete each survey. This type of information is also very useful to be able to design strategies aimed at optimizing attempts and, therefore, increasing sample levels in future operations.
2.9. Sampling Weights
The original sampling weight for the new samples is obtained from the inverse values of the effective sampling fractions in each stratum and used to calculate the Hajek estimator[62]. This is subsequently calibrated to obtain more reliable estimates based on the demographic characteristics of Andalusia. To this end, we use a truncated linear calibration method[63], and, as auxiliary information, the marginals of the Andalusian population per sex and age (16–19, 20–24, 25–29,...,75–79, and ≥80 years old), sex and province, sex and nationality (Spanish or dual nationality and foreign), and sex and degree of urbanization having obtained these data from the Continuous Municipal Register (Padrón Continuo de Municipios)[64]. Regarding the non-response bias in longitudinal samples, we can predict non-contact and non-cooperation based on auxiliary information and information already known about the sample subjects. Thus, the original weights used in the estimates of longitudinal sample M_t are corrected during a first phase by modelling the non-response with respect to the longitudinal effective sample obtained in M_t-1 using machine learning techniques[65]. Said non-response is estimated using a XGBoost model[66], which represents the state-of-the-art in machine learning. Every piece of data and variable from the M_t-1 sample is used for training, thus the algorithm has all the information available in order to learn. Likewise, the hyperparameters of the model are optimized using cross-validation to ensure reliable estimations. Then, during a second phase, they are calibrated following the method described for the new samples. As auxiliary variables, we use those extracted from the Continuous Municipal Register (e.g., nationality, sex, age, province, degree of urbanization, etc.) and the registers from the ESSOC itself in M_t-1 (Table 3).
Table 3. ESSOC: Adjustment of the design sampling weight in each measurement
|
Type of adjustment
|
Sample type (effective)
|
1st phase
|
2nd phase
|
New
|
Non-response adjusted by proxy based on the effective sample size in each stratum.
|
Representativeness by truncated linear calibration with 0.1 and 10 limits based on the auxiliary variables
|
Longitudinal
|
Non-response adjusted using an XGBoost model based on variables from the previous measurement.
|
In addition to these adjustments, other methodological alternatives, not yet explored in this type of sample design, for instance double calibration, will also be investigated by considering different variables in order to model non-responses and, on the other hand, correction of the representativeness bias[67,68] and machine learning techniques, and adjusting non-responses with the aid of the Propensity Score Adjustment (PSA)[69,70].
2.10. Variables
The study variables will mainly be extracted from the following sources: BDLPA; the Andalusian Population Health Database (BPS, Base Poblacional de Salud)[71]; the Andalusian Environmental Information Network (REDIAM)[72]; the Andalusian Epidemiological Surveillance System (SVEA)[73] and the ESSOCgeneral, ESSOCzones, and ESSOC+55 surveys.
The personal data of the participants selected for the interview (name, surname, and telephone number) are extracted from the BDLPA. In addition, the BDLPA is linked annually with a repository of georeferenced buildings so that the postal address and coordinates (250m x 250m grid) in the territory can be extracted. This will allow us to extract geographical factors (urbanization degree and population density, among others) via other IECA registries, and environmental factors (pollution and temperature, among others) via the REDIAM registry from the Andalusian Regional Government’s Department of Agriculture, Livestock, Fisheries, and Sustainable Development (Consejería de Agricultura, Ganadería, Pesca y Desarrollo Sostenible de la Junta de Andalucía).
From the SVEA registry, epidemiological information related to COVID-19, such as the date and result of the diagnostic test for active infection (PDIA), will be extracted to detect the presence of an active SARSCoV-2 infection, which includes both reverse transcription– polymerase chain reaction (real time RT – PCR) as the antigen (Ag) rapid test; date of the onset of symptoms; close contact of confirmed case with PDIA; local or imported case; occupation as a health or social health professional; need for hospitalization or admission to an intensive care unit; date of admission and discharge.
In addition, the clinical information related to chronic diseases[74], functional and cognitive assessments, health resources (volume and cost), population stratification, and drugs consumed, which is obtained from the BPS, will also be added to the valid samples (Table 4). Further information about the variables and the main features of the abovementioned registers can be found in Supplementary Material 3.
Table 4. ESSOC: Auxiliary sources and variables
Registry
|
Description
|
Information
|
Variables extracted
|
BDLPA – Longitudinal Andalusian Population Database[71]
|
Information from the census coordination system and civil registries that give rise to a consolidated framework of the Andalusian population
|
Personal data
|
Name, surname, identification health number (NUHSA), geographical coordinates
|
BDU – User Database of the Andalusian Public Health Care System[58]
|
Contact Information of the Andalusian Public Health Care System
|
Personal contact information
|
Telephone numbers
|
BPS - Andalusian Population Health Database[71]
|
Personal health information from the Andalusian Population Health Database and Healthcare information
|
Health and healthcare information
|
Chronic diseases, functional and cognitive assessments, health resources (volume and cost), population stratification, drugs consumed
|
REDIAM - Andalusian Environmental Information Network[72]
|
Daily averages by collecting/meteorological station and at the census section level
|
Pollution, temperature
|
Mean daily values from pollution, air quality and temperature
|
SVEA - Andalusian Epidemiological Surveillance System[73]
|
Functional organization for health surveillance that collects, among other things, epidemiological information related to SARS-COV-2 infection
|
Epidemiological information of COVID-19
|
PCR result, symptoms date, close contact, healthcare professional, hospitalization unit (specifying ICU), date of admission and discharge, need of mechanical ventilation and clinical data
|
For further details, see Supplementary Material 3
With regard to the surveys, each measurement is associated with a questionnaire that coincides, to a significant extent, with previous measurements to enable an analysis of its evolution, and incorporate new information to analyse specific characteristics present at each moment of the pandemic. Repeating unchanging information in subsequent measurements is avoided in the case of longitudinal samples. The questionnaire used for each measurement is organized into blocks of information, as shown in Table 5.
Table 5. ESSOC: Information blocks and variables entered in each measurement.
Subject area
|
1st measurement (M1)
|
2nd measurement (M2)
|
3rd measurement (M3)
|
4th and 5th measurements (M4 and M5)
|
Household and housing characteristics
|
Municipality, usual household, type of household, surface area, facilities, household changes, number of cohabitants (<6/<16/>60), type of household, and equipment.
|
Municipality, usual household, type of household, surface area, facilities, household changesb, number of cohabitants (<6/<16/>60), equipment, number of rooms, and number of inhabitants with disabilities or requiring care.
|
Municipality, usual household, type of household, surface area, household changes, number of cohabitants (<6/<16/>60), number of rooms, and number of inhabitants with disabilities or requiring care.
|
Municipality, usual household, type of household, surface area, household changes, number of cohabitants (<6/<16/>60), number of rooms, and number of inhabitants with disabilities or requiring care.
|
Time use and cohabitation
|
Household chores, care tasks, daily activities during the confinement period (at home and outside), cohabitation and relationships, and causes for optimism.
|
|
|
Household chores, care tasks, daily activitiesb during the confinement period (at home and outside), cohabitation and relationships, and causes for optimism.
|
Health and emotional well‑being
|
COVID-19 diagnosis, severity, diagnosis within the person’s settings, self‑perception of general and mental health (current and last year), emotional well-beingc, difficulty to withstand the confinement, malaise, chronic illness, and change of medication.
|
COVID-19 diagnosis, severity, diagnostic tests, diagnosis within the person’s settings, self‑perception of general and mental health, emotional well-beingb, cohabitation, difficulty to withstand the confinement, happiness, social and emotional supportc, malaiseb, chronic diseases (suffering and limitations), and medication (use and change of useb).
|
COVID-19 diagnosisb, severity, diagnostic tests, diagnosis within the person’s settings, self‑perception of general and mental health, emotional well-beingbc, happiness, social and emotional supportbc, malaiseb, chronic diseases (suffering and limitations), and medication (use and change of useb).
|
COVID-19 diagnosisb, severity, diagnostic tests, diagnosis within the person’s settings, self‑perception of general and mental health, emotional well-beingbc, happiness, social and emotional supportbc, malaiseb, chronic diseases (suffering and limitations), and medication (use and change of useb).
|
Habits and lifestyle
|
Habit modification (exercising, smoking, alcohol consumption, sleep, and diet).
|
Habit modificationb: exercising, drinking, smoking, sleep, food, daily intake of vegetables and fruit, exercising, weight and heightc, smoking, alcohol consumption sleep, and flu vaccination.
|
Habit modificationb: exercising, drinking, smoking, sleep, food, daily intake of vegetables and fruit, exercising, weight and heightc, smoking, alcohol consumption sleep, and flu vaccinationb
|
Habit modificationb: exercising, drinking, smoking, sleep, food, daily intake of vegetables and fruit, exercising, weight and heightc, smoking, alcohol consumption sleep, and flu and COVID-19 vaccinationb
|
Economic situation and socio-demographic characteristics
|
Educational level, employment situation, working from home, type of contract, occupationc, cohabitation with a partner, identification of the cohabitant with the greater income (educational level, employment situation, type of contract, occupation), difficulty in making ends meet, late payments, income, future worries, and degree of confidence in public institutions.
|
Employment situation, educational level, occupationc, developmentb, ability to work, identification of the cohabitant with the greater income (educational level, employment situation, occupation), difficulty in making ends meet, late payments, change in economic situation, parents' educational level, and future worries.
|
Employment situation, educational level, occupationc, development, ability to work, identification of the cohabitant with the greater income (educational level, employment situation, occupation), difficulty in making ends meet, late paymentsb, change in economic situation, parents' educational level, and future worries.
|
Employment situation, educational level, occupationc, development, ability to work, identification of the cohabitant with the greater income (educational level, employment situation, occupation), difficulty in making ends meet, late paymentsb, change in economic situation, parents' educational level, and future worries.
|
a Questionnaires are provided as Supplementary Material 2
b Variables that present modifications in their temporal scope in relation to the previous measurement.
c Composite variables: emotional well-being[75], social and emotional support[76,77], body mass index [78], social class[79,80]
2.11. Data Analysis
The analyses take advantage of all the information available from the measurements and the auxiliary information sources and will be carried out with the free software environment R[81], considering the sample design, as well as the calibration and inference methods described in the previous sections. The use of free software will guarantee transparency and facilitate the replicability of the study.
A table will be prepared for each variable of each measurement, together with the variable’s original response categories, including the valid sample size (n), the percentage of lost samples, the population size estimate (N), the relevant statistic (mean or percentage), the 95% CI, and the coefficient of variation (CV), for both the total and the cross‑disaggregation per sex and age (16–29/30–44/45–64/65+), as well as per province and urbanization degree. The sample size is recorded for the total and the categories of the segmentation variable. In the case of cells with CV estimates >20%, the CV will be indicated in a footnote to the table.
The variables shared by all measurements are dichotomized based on the results reported in the previous tables, identifying, in each case, the most convenient category to be highlighted based on the previous tables. In addition, a table describing the specific estimates with their corresponding CV will also be created.
Alternatively, to evaluate the changes in each measurement with respect to the first one, both the population affected by the change and the percentage segmented per demographic and territorial variables will be estimated. The p value will be calculated to evaluate the effect of such change and will be indicated in a footnote to the table using three categories: p <0.001, p <0.05, and p <0.1.
In the case of variables that coincide in each pair of consecutive measurements (M2–M1, M3–M2, and M4–M3), the estimated percentage of the difference between one measurement and the previous one will be calculated in addition to the estimate of the population size and the signalling when the CV is greater than 20% and segmented by the demographic and territorial variables.
To analyse factors associated with variables of a given measurement or variables measuring the change between one measurement and another, we will use multivariate explanatory models adapted to the characteristics and nature of the variables and specified as generalized linear mixed models (GLMM) with a family dependent on the type of dependent variable: Gaussian, when the variable is continuous (equivalent to a linear regression); binomial, when the variable is dichotomous (equivalent to a logistic regression); or Poisson, when the variable is discrete (equivalent to a Poisson regression). Random effects will be included in these models to capture the effects of unobserved confounders. Inferences will be made following a Bayesian perspective and using the integrated nested Laplace approximation (INLA)[82,83]. We will use penalized complexity priors known as PC priors. These priors are robust in the sense that they do not impact the results, and, in addition, they allow for an epidemiological interpretation[84]. The analyses will be carried out using free software R (version 4.0.4 or greater)[81], through the INLA package[82,83,85].
Finally, advanced data visualizations will be used to allow an in-depth exploratory analysis of the evolution of the study variables and a representation of the main results of the produced models. These visualizations will be developed using Python[86] programming language and integrated into software and web solutions that allow for interaction and dissemination.
2.12. Data Management Plan
The data management plan is provided as Supplementary Material 1. The type and format of data that will be collected and generated within the scope of this project is described in this plan, together with the procedure provided to access data (by whom, how, and when it can be accessed), data ownership, repository to deposit data, and procedure planned to guarantee the specific ethic and legal requirements.
Details of the Data Protection Impact Assessment (DPIA) will also be presented in here in accordance with the specific adaptation of this methodology to research projects in the health care sector[87,88,89]. Thus, the need for a DPIA was confirmed from the outset (Table S1). Subsequently, the data lifecycle was defined (Table S2), and the need and proportionality of the processing were analysed (Table S3) and, finally, a risk assessment and action plan developed (Table S4).
2.13. Scoping Review
The final objective will be achieved through a Scoping Review, whereby the existing evidence on survey-based research related to the measurement of the impact COVID‑19 (from the outset of the pandemic) has had on health and its determinants are summarized. The review will be carried out using free terms and controlled language in the databases Pubmed, Scopus, WoS, EMBASE, CINAHL, PsyInfo, LiLac, OpenGray, Gray Literature Report and, likewise, through a free search in Google and institutional websites to locate institutional documents, abstracts, conferences or in any other format where studies and research work can be found. In a first phase, the research work will be selected independently and blindly in pairs, and the study populations, sample size and main objective information will be identified in order to collect the methodological characteristics, including elements on epidemiological design, auxiliary sources of information, methodologies, and topics addressed. This review will be carried out at the beginning of the project and updated throughout its duration, which will allow the development of the rest of activities, especially the identification of hypotheses and the application of other methodologies to be guided, as well as define the lines of research and propose more appropriate methods for future research.