4.1. Visual stimuli
Nine restaurants from Trip Advisor were selected as stimuli for this study. Based on Trip Advisor ratings, three restaurants were selected as representatives of low-cost or low-budget, three as mid-range, and three as exclusive restaurants. The aim was to achieve variability in the ratings of the restaurants in order to elicit different ratings of general preferences from the participants and the LLMs. This was necessary to obtain greater variability in the responses to assess the covariation between the ratings of the LLMs and humans. With a similarly rated restaurants, we would obtain more homogeneous ratings, so we would not be able to determine whether the LLMs is able to distinguish restaurants in a meaningful way. Furthermore, since familiarity with restaurants may contribute to restaurant ratings, we randomly selected restaurants from Barcelona as the stimuli for our study, a city that was also randomly selected. In this way, participants were asked to rate unfamiliar restaurants. The participants rated the selected restaurants based solely on photographs and not on their customer experience with these objects. In this way, all raters (humans, GPT-4V and Gemini Pro Vision) rated unfamiliar restaurants.
The photographs were rated on a five-point Likert-type scale. To assess the general preference of the restaurant, we asked participants: “On a scale from 1 to 5 (1 = low, 5 = high), how much do you like this restaurant overall?”. In addition, we asked participants, “On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the interior photographs?”, and “On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the food photographs?”. These two questions were included in the study to compare the degree of agreement in classification with the degree of agreement in general preferences between LLMs and participants. Hence, each restaurant was assessed on the basis of three ratings: general preferences, interior design and food. All visual materials used in this study can be found in the Supplemental information (see S4. Restaurants photographs in Supplementary information).
4.2. GPT-4V and Gemini Pro Vision prompts (March 2024)
With slight differences in the implementation, we prompted both the GPT-4V and Gemini APIs in the same fashion. In both APIs, passing text prompts and URLs of restaurant photographs is possible. In addition to the textual and visual part of a prompt to an LLM, it is essential to mention that we left the parameter that regulates the randomness of an LLM, the temperature, at its default value.
From the survey data, we selected the age, socioeconomic status, gender and work status cells for each datum, resulting in 505 such 4-dimensional datums. First, we wanted to ask an LLM, given CSV of such designated 505x4-dimensional data,how it should rate the restaurants overall, their interior design and their food according to the provided CSV of the participants. The prompt looked like this:
„I’ll provide you with 4 images of a restaurant. Also, I’ll provide you with data that consists of 505 rows. Each row represents a person. The first cell in a row represents age, the second socioeconomic status, the third cell represents sex, and the last cell represents working status. Here are the data:
21,2,F,S
17,2,M,S
...
19,3,M,S
21,1,F,S
Next, I’ll define 3 questions:
On a scale from 1 to 5 (1 = low, 5 = high), how much do you like this restaurant overall?”
On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the interior images?”
On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the food images?
Please, for each row in data, i.e. for each person with their personal data, answer these three question. Let the final output be CSV. Be very concise and don’t describe given images. Don’t give reasons and just focus on the answering task!“
Due to a limited number of tokens for input, we could not perform such a prompt in a single shot. Instead, we split the data into groups of 10 datums (the last group consists of 5 datums) and performed the aforementioned prompt piece by piece. Responses obtained for each prompt were parseable, resulting in original information (e.g., 17,2,M,S) and simulated rates (e.g., 4,3,4) for each datum.
4.3. Participants and procedure
A community sample of N = 505 participants (222 males and 283 females) from HIDDEN INFORMATION participated in the study. They were recruited by students in exchange for course credits. The average age of the participants was M = 26.59 (SD = 10.85). 62.38% of the sample were students, 37.62% were employed. 41 participants were in the categories of retired and unemployed and were excluded from the analysis as they were highly underrepresented in the original sample (consisting of 546 participants). In addition, we assessed self-reported socioeconomic status (SES), where 8.91% of participants grouped themselves in a below-average, 61.98% in an average and 29.11% in an above-average group. Ethical board of HIDDEN INFORMATION approved the study.
4.4. Statistical analysis
Three different raters (participants, GPT-4V and Gemini) rated nine restaurants on three criteria (general preference, interior design and food). When data collection with participants was completed, GPT-4V and Gemini were prompted to provide samples of 505 ratings on all three questions for all nine restaurants in the observed roles of participants defined by gender, work status, age, and socioeconomic status, which is feasible according to a previous study [22]. In this way, we obtained two mirrored datasets that matched the original data collected from the participants. Hence, the participants and the LLMs provided their ratings on the same rating scale, with the same sample structure based on the demographic description of the participants and with the same number of observations, i.e. 505 observations each. This was important to achieve since it would not be justified to estimate the level of agreement between the arithmetic means of the participants ratings with only one rating from each LLM. Moreover, the LLMs temperature enable different responding which increases the variability of responses. If only one rating was used from LLM, this would decrease the likelihood of replicating findings as LLM is capable of providing different ratings for the same repeated prompt. Therefore, in this study, the only difference between these three datasets were the answers (i.e. ratings) to the three questions mentioned above, while all other aspects were the same (number of prompts corresponds to the number of participants, demographic structure of participants/roles, items/prompts). The ratings obtained from the three sources represented dependent variables in our studies.
The ratings in the mirrored datasets obtained with the GPT-4V and Gemini were compared with the participants’ ratings. The comparison, i.e. the degree of agreement in the ratings between all three sources (i.e. the raters), is not suitable at an individual level, as it is not reasonable to expect that the LLM can account for all inter-individual differences between participants in restaurant ratings. Therefore, the degree of agreement must be assessed at a more general, i.e. group level. The arithmetic means were calculated for each rater (participants, GPT-4 and Gemini Pro Vision) and were used to assess the level of agreement in the ratings for all three criteria (general preference, interior design, food). The alternative statistical approach, which was considered but not applied in this study, is elaborated in more detail in S1. Discussion in the Supplementary information.
The degree of agreement between the LLMs ratings and the human rating was analyzed using the intraclass correlation coefficients (ICC). This statistical index can evaluate the agreement between two or more raters (in our case participants, GPT-4V and Gemini Pro Vision) of the same group of objects (in this case restaurants). Since we have the same set of raters (participants, GPT-4V and Gemini Pro Vision) for the same sample of observations (9 restaurants), we calculated the ICC(3, k) according to the classification of ICC models [23]. We estimated the degree of agreement for the ratings of general preferences, interior design and food. According to Koo & Li [24], ICC values below .50 indicate poor agreement, values between .50 and .75 indicate moderate agreement, between .75 and .90 indicate good agreement, and above .90 indicate excellent agreement. As the ratings presents average values obtained from 505 participants or prompts, we increased the generalizability of the ratings but decreased the number of observations (role of participants). Nevertheless, the ICC can be assessed with a smaller number of observations. Our sample of three raters with nine observations is sufficient to detect an agreement level of approximately 0.70 with a predetermined value of alpha = 0.05 and a power of 0.8 [21].