Evaluating the Agreement between Human Preferences, GPT-4V and Gemini Pro Vision Assessments: Can AI Recognise Which Restaurants People Might Like?

doi:10.21203/rs.3.rs-4257623/v1

Download PDF

Article

Evaluating the Agreement between Human Preferences, GPT-4V and Gemini Pro Vision Assessments: Can AI Recognise Which Restaurants People Might Like?

https://doi.org/10.21203/rs.3.rs-4257623/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The study aims to introduce a methodology for assessing agreement between AI and human ratings, specifically focusing on visual large language models (LLMs). It presents empirical findings on the alignment between ratings generated by GPT-4 Vision (GPT-4V) and Gemini Pro Vision with human subjective evaluations of environmental visuals. Using photographs of restaurant interior design and food, the study estimates the degree of agreement with human preferences. The intraclass correlation reveals that GPT-4V, unlike Gemini Pro Vision, achieves moderate agreement with participants’ general restaurant preferences. Similar results are observed for rating food photos. Additionally, there is good agreement in categorizing restaurants into low cost-exclusive categories based on interior quality. Overall, GPT-4V currently demonstrates limited ability in providing meaningful ratings of visual stimuli compared to human ratings and performs better in this task compared to Gemini Pro Vision.

Scientific community and society/Social sciences/Psychology/Human behaviour

Physical sciences/Mathematics and computing/Computer science

At the time of writing, there is great interest in assessing how well artificial intelligence (AI) can perform some operations compared to humans [1] (Wu et al., 2023). Studies have explored AI’s capacity for creativity [2], verbal problem-solving [3], emotional awareness [4], and many more. These types of studies pose a methodological challenge since they are situated between computer science and psychology, two scientific fields that differ greatly in their methodological approach. Recently, Hagendorff et al. [5] recognized the importance of applying a methodology typical of psychological research when studying AI performance. In this paper, we propose a methodological framework for assessing the correspondence between people’s subjective judgments and AI’s ratings to examine whether large language models (LLM) can successfully predict what people might like and how they perceive and interpret visual stimuli without additional fine-tuning. We hope this could give an insight into how well current LLMs understand people’s choices in everyday situations.

This subject holds significance as individuals routinely make countless judgments and decisions, ranging from minor to major, on a daily basis. It is imperative to investigate whether AI systems have the capability to forecast human behavior. This is also a challenge since people do not make decisions in the way computer algorithms usually operate. Namely, people use heuristics, i.e., cognitive shortcuts, to make a decision when faced with a situation in which they have to decide whether to perform a certain action [6]. For instance, when choosing a restaurant for dining out, without using heuristics, one should review and compare all existing restaurants in the area and then choose the best one based on numerous criteria. This can be impossible to do in large cities with a couple of thousand restaurants, while in smaller cities, it might be at least time-consuming. Therefore, individuals often opt for decisions with minimal cognitive effort, forgoing the exhaustive comparison of numerous alternatives, and generally relying on their subjective feelings - an ability that computer programs lack. We must note, however, that some preliminary studies indicate that LLM may reason according to heuristics [7] that are more typical of human behavior. However, more studies are needed to substantiate this claim.

Emotions are essential for understanding human motivation because they can trigger the cognitive mechanisms underlying human behavior [8]. According to subjective utility theory [9] (Tversky, 1967), people make decisions about whether to perform an action based on two main assessments. The first component in decision-making is the assessment of general liking (if something is positive) or disliking (or aversion). For example, when people are in a situation where they have to decide whether to enter a restaurant, they can form an initial impression based on the appearance of the interior and the food served there. This first stage in making a decision to enter the restaurant is referred to as valuation or desirability. It is usually saturated with an affective impression, which in psychological research can typically be rated on a five-point Likert-type scale. The second aspect of our decision to act is the likelihood of carrying out the desired action. It can be represented in the form of percentages ranging from 0 (i.e., a person is certain that they will not take that action) to 100 (i.e., a person will undoubtedly act accordingly). These calculations are based entirely on a subjective level or feeling [9]. When choosing a restaurant, this second component can be influenced by subjective judgments, such as the affordability a meal at that particular restaurant. For instance, if a person cannot afford to dine at an exclusive restaurant, it will not enter the restaurant irrespective of their desire to have dinner at that restaurant (e.g., they rate it five on a five-point Likert scale).

This study examines the degree of agreement between humans and GPT4 with Vision (GPT-4V) and Google Gemini Pro Vision ratings for general preferences (i.e., general liking) of restaurants based exclusively on visual stimuli. The degree of agreement between the AIs and participants’ ratings will give us information on how well the aforementioned LLMs can reflect human subjective preferences.

This type of research is now possible with two flagship multimodal large language models that represent a natural evolution in the capabilities of LLMs - GPT-4V and Gemini Pro Vision. They enable the standard LLM to integrate visual information by processing images and simultaneously responding to associated text queries. Such integration of vision capabilities significantly expands the scope of potential use in practical scenarios while bringing AI systems closer to the human experience of the world.

Previous studies suggest that we might expect a fair level of agreement between people’s ratings of objects and LLMs. In the realm of LLM approaches, research in image analysis has focused on the automatic evaluation of the esthetic quality of photographs and has attempted to evaluate the attractiveness of images based on user ratings (see [10]). In the domain of multimodal LLM applications, especially in traffic scenarios, recent advances address the challenges of interpretability in autonomous driving systems. DriveGPT4 represents a significant step towards interpretable, end-to-end autonomous driving [11]. Harnessing LLMs, DriveGPT4 processes visual data and draws conclusions from it, demonstrating its potential for use in the real world.

In addition, the Distracted Driving Language Model [12] (DDLM) improves visual LLMs with a reasoning chain framework, achieving better performance in driver behavior analysis and risk assessment. In addition, some studies address the capabilities of large vision-language models such as GPT-4V in recognizing and understanding traffic accidents [13, 14]. While these models show remarkable cognitive capabilities in classical traffic events, they encounter challenges in complex scenarios. The findings from these studies pave the way for further research, elucidating the strengths and limitations of employing large vision-language models in recognizing traffic events. However, these studies show that existing AI systems can process visual stimuli and make decisions in complex situations.

In the domain of art descriptions, ArtGPT-4 (i.e., an open AI LLM specialized for art images) stands out, aiming at understanding and generating descriptions of art images [15]. The authors of [6] noted that the available open-sourced visual LLMs are still not able to capture emotional and esthetic information from art images as humans can. By applying improved training techniques and introducing new evaluation datasets (ArtEmis), ArtGPT-4 achieves peak performance that enhances the understanding of artistic images. Further, IQAGPT introduces a system using large language and vision- language models in image quality assessment. IQAGPT outperforms other models in assessing image quality by fine-tuning vision- language models on specific datasets, demonstrating its feasibility and potential for various applications, particularly in medical imaging.

Following similar scenarios, several studies use machine learning techniques to predict restaurant ratings. The paper [16] utilizes a DistilBERT model for analyzing OpenTable reviews, achieving improved accuracy compared to traditional machine learning models. Similarly, [17] predicts restaurant ratings on Yelp using text and non-text features, highlighting the effectiveness of decision trees and neural network algorithms. Finally, Dining on Details (DoD) [18] introduces a novel expert learning framework for fine-grained food recognition, leveraging large language models and multi-modality embedding spaces. This approach achieves state-of-the-art results on various food datasets, demonstrating its effectiveness in fine-grained food recognition tasks. In [19], the authors propose a novel approach to infer restaurant types or styles (such as ambiance, dish styles, and suitability for different occasions) based on user-uploaded photos from restaurant review websites. They collect a restaurant photo dataset associating user-contributed photos with restaurant styles and employ a deep multi-instance, multi-label learning framework to address the unique problem setting of restaurant-style classification. The approach effectively profiles restaurant styles when there are sufficient user-uploaded photos for a given restaurant.

Taken together, these various studies highlight the adaptability and potential of large language and vision- language models across a range of domains that include esthetic quality assessment, traffic analysis, art comprehension, and image quality evaluation. However, to our knowledge, no other study has directly compared the evaluation of visual stimuli by human and visual LLMs.

Therefore, this study aims to extend the existing research on the application of AI in the field of subjective ratings of visual stimuli. In particular, we investigate whether the ratings of photographs by LLMs without additional fine-tuning,correspond to human subjective preferences. To achieve the goal of this study, we will analyze the degree of agreement between human and LLM ratings of restaurants.

The descriptive statistics for all measurements can be found in Table 1. The data for the general preferences are presented graphically in Fig. 1, while the figures for the interior design and food ratings can be found in Supplemental materials (Fig. S1 and S2). The Bland-Altman plots show the tendency of LLMs to overestimate ratings in cases where participants gave lower ratings to the so-called low-cost restaurants (ratings R1, R4 and R7 in Table 1) and to underestimate ratings for exclusive restaurants (R3, R8 and R9 in Table 1). The differences in ratings (values on the y-axis) were smaller for mid-range restaurants (R2, R5 and R6 in Table 1). All of this suggests that LLMs tend to give conservative ratings, i.e. avoid extreme responses, which is also evident from the restricted range of values on the x-axis in the figure showing the agreement between the GPT-4V and Gemini Pro Vision ratings (see Fig. 1). In the case of general preferences for all nine restaurants, the GPT-4V gave average ratings between 3.17 and 3.85, while the Gemini Pro Vision ratings ranged from 4.07 to 4.34 for very different restaurants (see Table 1). Based on units on the y-axis in the figures (Fig. 1), we can see that the measurement distances of participants’ preferences are higher for Gemini Pro Vision (Fig. 1.b) than for the GPT-4V ratings (Fig. 1.a). The same tendency can be seen for the food ratings (Fig. S1), while lower measurement distances were generally observed for the restaurant interior ratings (Fig. S2).

Table 1

Descriptive statistics for ratings of general preferences, interior design and food served in restaurants for all three raters
	General preference						Interior design						Food
	Human		GPT		Gemini		Human		GPT		Gemini		Human		GPT		Gemini
	M	SD	M	SD	M	SD	M	SD	M	SD	M	SD	M	SD	M	SD	M	SD
R1^c	2.53	0.91	3.74	0.47	4.16	0.52	2.21	0.75	2.68	0.56	3.03	0.60	2.08	0.77	2.82	0.49	3.89	0.53
R2^m	3.76	0.83	3.79	0.42	4.12	0.50	3.57	0.70	3.01	0.48	3.47	0.72	3.37	0.77	2.97	0.40	3.82	0.60
R3^e	4.25	0.88	3.78	0.44	4.34	0.58	4.65	0.58	3.73	0.52	4.11	0.65	4.61	0.64	3.71	0.49	4.18	0.61
R4 ^c	2.19	1.02	3.17	0.47	4.19	0.53	1.68	0.76	2.20	0.52	3.09	0.58	1.77	0.85	2.84	0.58	3.84	0.61
R5^m	4.02	0.85	3.82	0.40	4.21	0.56	3.66	0.72	3.01	0.42	3.35	0.70	3.68	0.75	3.04	0.42	3.87	0.67
R6^m	4.04	0.83	3.75	0.45	4.07	0.50	3.96	0.76	3.58	0.57	3.67	0.59	3.92	0.79	3.13	0.54	3.83	0.53
R7^c	2.69	0.97	3.48	0.55	4.12	0.58	2.44	0.83	2.81	0.57	3.07	0.66	2.57	0.86	3.01	0.49	3.87	0.68
R8^e	4.42	0.82	3.80	0.44	4.08	0.43	4.52	0.66	3.73	0.52	3.93	0.54	4.40	0.76	3.55	0.58	3.98	0.47
R9^e	4.40	0.74	3.85	0.39	4.10	0.58	4.18	0.78	3.15	0.51	3.86	0.65	4.24	0.79	3.47	0.58	3.89	0.63
Total	3.59	0.87	3.69	0.45	4.15	0.53	3.43	0.72	3.10	0.52	3.51	0.63	3.40	0.78	3.17	0.51	3.91	0.59
Note: R – restaurant; c – low-cost restaurant; m- mid-range restaurant; e – exclusive restaurant. Numbers of restaurants indicate the schedule of presenting pictures of restaurants in the survey; M – Mean; SD – Standard deviation

Table 2 shows the degree of agreement, assessed by intraclass correlation coefficients (ICC), between all three raters (humans, GPT-4V and Gemini Pro Vision) for all three ratings (general preference, interior design and food) of nine restaurants based solely on the photographs. Moderate agreement was found between participants and the GPT-4V for general preference, while Gemini Pro Vision shows a complete lack of agreement with participants and the GPT-4V for general preferences. This is also evident from the arithmetic means in Table 1, where Gemini Pro Vision did not differentiate the restaurants from different categories. As can be seen from the photos of the restaurants (see the URLs of the photographs in S4. Restaurant photographs in Supplementary Information), the visual differences between the restaurants were quite obvious. In addition, all three raters showed a good level of agreement in their ratings of the restaurants’ interiors (Table 2). The GPT-4V ratings were slightly more consistent with the participants’ ratings than those of the Gemini Pro Vision. Near perfect agreement was achieved between the GPT-4V and the Gemini Pro Vision. Finally, the GPT-4V and participants’ ratings show moderate agreement on restaurant food ratings, while the Gemini Pro Vision shows no agreement with participants. The restaurant food ratings are at a moderate level between both LLMs.

Thus, compared to Gemini Pro Vision, GPT-4V provides more congruent ratings to participants in general. However, the level of agreement varies across the three criteria.

Table 2

Intraclass correlational coefficients between Human, GPT-4V, and Gemini Pro Vision in ratings of general preferences, interior design, and food
Raters	General preferences	Interior design	Food
Human vs GPT-4V vs Gemini	.405	.875	.589
Human vs GPT-4V	.593	.840	.673
Human vs Gemini	.000	.791	.205
GPT-4V vs Gemini	.000	.921	.651

As previously noted, LLMs offer a limited scope of evaluations. Hence, it is important to take a closer look at the differences in the distribution of ratings between the three raters. Humans, GPT-4V and Gemini show significant differences in the distribution of average ratings of restaurant categories (low-cost, mid-range and exclusive restaurants). These apparent differences result from the original ratings of all nine restaurants on a Likert-type scale ranging from 1 (lowest score) to 5 (highest score). People often tend to spread their ratings widely due to their subjective experiences, as they reflect different preferences and experiences. The GPT-4V and Gemini ratings, on the other hand, tend to have a more limited number of ratings spread across only a subset of all possible ratings.

Therefore, understanding and interpreting the distributions of ratings given by humans, GPT-4V and Gemini can provide deeper insights into the differences in perception and evaluation of restaurants between humans and artificial intelligence. Some basic measures of descriptive statistics (medians, means, bootstrap 95% CI for means and standard deviations) are provided along with the boxplots reflecting the corresponding distributions of average ratings as supplementary material to this article (title S2 in Supplemental informations).

Visual perception of the environment is essential for the formation of impressions that can subsequently facilitate human decision making. Our results suggest that GPT-4V and Gemini Pro Vision have limited ability to elicit general preferences from visual stimuli in a similar way to humans. This is evident from the ICC values, which show a moderate level of agreement for general preferences. In addition, ICC values are higher for categorizing restaurants based on photographs of the interior design, indicating good agreement, and slightly lower for ratings based on photographs of the food.

Regarding the degree of agreement in the general preference of restaurants, GPT-4V provides meaningful responses, albeit at a moderate level. In contrast, Gemini Pro Vision shows a lack of agreement on this criterion. All this leads to the conclusion that the current LLMs have limited ability to infer what people might like from visual stimuli. However, it is important to emphasize that we used LLMs off-the-shelf without any additional fine-tuning. We would like to note that additional fine-tuning could increase the degree of agreement with human ratings. The most important aspect of additional fine-tuning would be to train the LLMs to provide a wider range of ratings that would align more with the distributions of participants’ ratings.

The importance of this study resides in the presentation of the methodology used to assess the correspondence between people’s subjective preferences and LLMs. Although there are a growing number of studies in this area, to our knowledge this is the first to assess similarities in responses coded as Likert-type scales between humans and AIs. Previous studies used textual output from LLMs. Here, we contrasted the numerical output of LLMs and compared them with quantitative human ratings. Furthermore, the assessment of aptitudes (e.g., knowledge or abilities) is more straightforward than the assessment of attitudes because aptitude tasks can be coded as true or false answers. A more delicate task is assessing the validity of test items that do not contain true or false answers, such as personality questionnaires [20], which usually contain this type of response format. In this case, the similarity or responses between human subjects and AI do not reflect a result that could be considered a valid response. By means of such a methodological approach, this study suggests that LLMs can provide a meaningful response on a Likert-type scale, but their ability to evaluate visual stimuli similarly to human subjects is currently limited.

The direct implication of the study on the case of restaurant ratings can be seen in considering the potential use of these LLMs as a recommendation system for finding restaurants on specialized travel-related services such as TripAdvisor, Expedia or Lonely Planet, if future developments increase the degree of agreement in ratings. Instead of the usual filters where travelers specify criteria such as cuisine or location, LLMs may act as a more intuitive guide by analyzing photographs and recommending restaurants based on their visual appeal. In simpler terms, it is like having an AI companion that understands and appreciates photographs, much like a human. Future studies are needed to investigate whether LLM’s suggestions for restaurants match what humans find visually appealing, adding a new layer of assistance to traditional search filters when looking for a place to eat. Such an upgrade to conventional filter search could be a way to make the process more user-friendly and personalized by relying on AI’s ability to understand visual cues and preferences. However, this should not just be limited to restaurants. This AI capability (once improved) can also be helpful in many other recommendation systems for any other commercial item or service.

More generally, if future studies show that AI is able to provide meaningful ratings on various aspects similar to those of human subjects, this could open up the possibility of using LLMs as a tool to assess general human preferences on various topics, which could be very useful in social science, but also for commercial and industrial purposes. Namely, it is sometimes difficult and costly to collect a random sample of participants in social science. If LLM can mimic true participants’ responses on any subject (on a group level), this might be useful for stimulation studies on a wide range of topics. As our data indicate, this is not yet possible, but it seems likely in the near future.

The findings should be interpreted in light of the limitations of the study. Human behavior is highly context-dependent and multi-determined. Our data reveals a limited LLMs’ capability to rate restaurants according to photographs of the interior design and food in a consistent manner to humans, albeit to the moderate level. However, this does not imply that LLMs should perform equally with other objects than restaurants (e.g. car, clothes, travel destinations, etc.). Therefore, our findings represent only part of the information about the abilities of LLMs to predict people’s preferences. Future studies should expand the choice of object to examine the level of agreement of LLMs with humans before we can speak about the general ability of LLMs to reflect human preferences. In addition, our study relate to only one component of decision making – valuation, i.e. general preference. Future studies should investigate the ability of LLMs in predicting their real behavior, which should take into account the likelihood of performing actions to obtain the desired goal. Finally, we experimented with nine objects to assess the agreement in ratings between humans and LLMs. Since the number of raters is fixed (as there are only a limited number of LLMs with the ability to process visual stimuli) future studies should increase the number of objects per rater, as this might increase the statistical power of the study (see 21].

To conclude, the findings of this study suggest that LLM systems currently have limited ability to predict human preferences based solely on photographs of objects. We found moderate agreement between GPT-4V (but not Gemini Pro Vision) and human ratings of the general preferences of the restaurants. In contrast, agreement was higher for restaurant impressions based solely on photographs of the interior, and to a lesser degree, in the case of photographs of the restaurant’s food. Overall, GPT-4V achieved slightly higher levels of agreement compared to Gemini Pro Vision. The main implication of this study is that we have presented an objective method for assessing agreement between LLMs and human ratings, which could be useful in planning future studies in this dynamic field of AI research.

4.1. Visual stimuli

Nine restaurants from Trip Advisor were selected as stimuli for this study. Based on Trip Advisor ratings, three restaurants were selected as representatives of low-cost or low-budget, three as mid-range, and three as exclusive restaurants. The aim was to achieve variability in the ratings of the restaurants in order to elicit different ratings of general preferences from the participants and the LLMs. This was necessary to obtain greater variability in the responses to assess the covariation between the ratings of the LLMs and humans. With a similarly rated restaurants, we would obtain more homogeneous ratings, so we would not be able to determine whether the LLMs is able to distinguish restaurants in a meaningful way. Furthermore, since familiarity with restaurants may contribute to restaurant ratings, we randomly selected restaurants from Barcelona as the stimuli for our study, a city that was also randomly selected. In this way, participants were asked to rate unfamiliar restaurants. The participants rated the selected restaurants based solely on photographs and not on their customer experience with these objects. In this way, all raters (humans, GPT-4V and Gemini Pro Vision) rated unfamiliar restaurants.

The photographs were rated on a five-point Likert-type scale. To assess the general preference of the restaurant, we asked participants: “On a scale from 1 to 5 (1 = low, 5 = high), how much do you like this restaurant overall?”. In addition, we asked participants, “On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the interior photographs?”, and “On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the food photographs?”. These two questions were included in the study to compare the degree of agreement in classification with the degree of agreement in general preferences between LLMs and participants. Hence, each restaurant was assessed on the basis of three ratings: general preferences, interior design and food. All visual materials used in this study can be found in the Supplemental information (see S4. Restaurants photographs in Supplementary information).

4.2. GPT-4V and Gemini Pro Vision prompts (March 2024)

With slight differences in the implementation, we prompted both the GPT-4V and Gemini APIs in the same fashion. In both APIs, passing text prompts and URLs of restaurant photographs is possible. In addition to the textual and visual part of a prompt to an LLM, it is essential to mention that we left the parameter that regulates the randomness of an LLM, the temperature, at its default value.

From the survey data, we selected the age, socioeconomic status, gender and work status cells for each datum, resulting in 505 such 4-dimensional datums. First, we wanted to ask an LLM, given CSV of such designated 505x4-dimensional data,how it should rate the restaurants overall, their interior design and their food according to the provided CSV of the participants. The prompt looked like this:
„I’ll provide you with 4 images of a restaurant. Also, I’ll provide you with data that consists of 505 rows. Each row represents a person. The first cell in a row represents age, the second socioeconomic status, the third cell represents sex, and the last cell represents working status. Here are the data:

21,2,F,S

17,2,M,S

...

19,3,M,S

21,1,F,S

Next, I’ll define 3 questions:

On a scale from 1 to 5 (1 = low, 5 = high), how much do you like this restaurant overall?”

On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the interior images?”

On a scale from 1 to 5 (1 = low-cost, 5 = expensive), which category do you think the restaurant belongs to according to the food images?

Please, for each row in data, i.e. for each person with their personal data, answer these three question. Let the final output be CSV. Be very concise and don’t describe given images. Don’t give reasons and just focus on the answering task!“

Due to a limited number of tokens for input, we could not perform such a prompt in a single shot. Instead, we split the data into groups of 10 datums (the last group consists of 5 datums) and performed the aforementioned prompt piece by piece. Responses obtained for each prompt were parseable, resulting in original information (e.g., 17,2,M,S) and simulated rates (e.g., 4,3,4) for each datum.

4.3. Participants and procedure

A community sample of N = 505 participants (222 males and 283 females) from HIDDEN INFORMATION participated in the study. They were recruited by students in exchange for course credits. The average age of the participants was M = 26.59 (SD = 10.85). 62.38% of the sample were students, 37.62% were employed. 41 participants were in the categories of retired and unemployed and were excluded from the analysis as they were highly underrepresented in the original sample (consisting of 546 participants). In addition, we assessed self-reported socioeconomic status (SES), where 8.91% of participants grouped themselves in a below-average, 61.98% in an average and 29.11% in an above-average group. Ethical board of HIDDEN INFORMATION approved the study.

4.4. Statistical analysis

Three different raters (participants, GPT-4V and Gemini) rated nine restaurants on three criteria (general preference, interior design and food). When data collection with participants was completed, GPT-4V and Gemini were prompted to provide samples of 505 ratings on all three questions for all nine restaurants in the observed roles of participants defined by gender, work status, age, and socioeconomic status, which is feasible according to a previous study [22]. In this way, we obtained two mirrored datasets that matched the original data collected from the participants. Hence, the participants and the LLMs provided their ratings on the same rating scale, with the same sample structure based on the demographic description of the participants and with the same number of observations, i.e. 505 observations each. This was important to achieve since it would not be justified to estimate the level of agreement between the arithmetic means of the participants ratings with only one rating from each LLM. Moreover, the LLMs temperature enable different responding which increases the variability of responses. If only one rating was used from LLM, this would decrease the likelihood of replicating findings as LLM is capable of providing different ratings for the same repeated prompt. Therefore, in this study, the only difference between these three datasets were the answers (i.e. ratings) to the three questions mentioned above, while all other aspects were the same (number of prompts corresponds to the number of participants, demographic structure of participants/roles, items/prompts). The ratings obtained from the three sources represented dependent variables in our studies.

The ratings in the mirrored datasets obtained with the GPT-4V and Gemini were compared with the participants’ ratings. The comparison, i.e. the degree of agreement in the ratings between all three sources (i.e. the raters), is not suitable at an individual level, as it is not reasonable to expect that the LLM can account for all inter-individual differences between participants in restaurant ratings. Therefore, the degree of agreement must be assessed at a more general, i.e. group level. The arithmetic means were calculated for each rater (participants, GPT-4 and Gemini Pro Vision) and were used to assess the level of agreement in the ratings for all three criteria (general preference, interior design, food). The alternative statistical approach, which was considered but not applied in this study, is elaborated in more detail in S1. Discussion in the Supplementary information.

The degree of agreement between the LLMs ratings and the human rating was analyzed using the intraclass correlation coefficients (ICC). This statistical index can evaluate the agreement between two or more raters (in our case participants, GPT-4V and Gemini Pro Vision) of the same group of objects (in this case restaurants). Since we have the same set of raters (participants, GPT-4V and Gemini Pro Vision) for the same sample of observations (9 restaurants), we calculated the ICC(3, k) according to the classification of ICC models [23]. We estimated the degree of agreement for the ratings of general preferences, interior design and food. According to Koo & Li [24], ICC values below .50 indicate poor agreement, values between .50 and .75 indicate moderate agreement, between .75 and .90 indicate good agreement, and above .90 indicate excellent agreement. As the ratings presents average values obtained from 505 participants or prompts, we increased the generalizability of the ratings but decreased the number of observations (role of participants). Nevertheless, the ICC can be assessed with a smaller number of observations. Our sample of three raters with nine observations is sufficient to detect an agreement level of approximately 0.70 with a predetermined value of alpha = 0.05 and a power of 0.8 [21].

Wu, T., He, S., Liu, J., Sun, S., Liu, K., Han, Q. L., & Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10, 1122-1136 (2023). https://doi.org/10.1109/JAS.2023.123618
Breithaupt, F., Otenen, E., Wright, D. R., Kruschke, J. K., Li, Y., & Tan, Y. Humans create more novelty than ChatGPT when asked to retell a story. Scientific Reports, 14, 875 (2024). https://doi.org/10.1038/s41598-023-50229-7
Orrù, G., Piarulli, A., Conversano, C., & Gemignani, A. Human-like problem-solving abilities in large language models using ChatGPT. Frontiers in artificial intelligence, 6, 1199350 (2023). https://doi.org/10.3389/frai.2023.1199350
Elyoseph, Z., Hadar-Shoval, D., Asraf, K., & Lvovsky, M.. ChatGPT outperforms humans in emotional awareness evaluations. Frontiers in Psychology, 14, 1199058 (2023) https://doi.org/10.3389/fpsyg.2023.1199058
Hagendorff, T., Fabi, S., & Kosinski, M.. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nature Computational Science, 3, 833-838 (2023) https://doi.org/10.1038/s43588-023-00527-x
Tversky, A., & Kahneman, D.. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5, 207-232 (1973) https://doi.org/10.1016/0010-0285(73)90033-9
Suri, G., Slater, L. R., Ziaee, A., & Nguyen, M. Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. Journal of Experimental Psychology: General. 153, 1066–1075. (2024). https://doi.org/10.1037/xge0001547
Reisenzein, R., Corr, P. J., & Krupić, D. Motivation, Emotions and Personality. In P. J. Corr & D. Krupić (Eds.). Personality & Intelligence: The Psychology of Individual Differences. Oxford: Oxford University Press. (2024).
Tversky, A.. Additivity, utility, and subjective probability. Journal of Mathematical psychology, 4, 175-201 (1967). https://doi.org/10.1016/0022-2496(67)90049-1
Rubio, F., Flores, M. J., & Puerta, J. M. Ranking-based scores for the assessment of aesthetic quality in photography. Signal Processing: Image Communication, 108, 116803 (2022). https://doi.org/10.1016/j.image.2022.116803.
Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.-Y. K., Li, Z., & Zhao, H. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model. arXiv preprint arXiv:2310.01412 (2024). https://doi.org/10.48550/arXiv.2310.01412
Zhou, X., & Knoll, A. C.  GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events. arXiv preprint arXiv:2402.02205 (2024). https://doi.org/10.48550/arXiv.2402.02205
Driessen, T., Dodou, D., Bazilinskyy, P., & De Winter, J. C. F.  Putting ChatGPT Vision (GPT-4V) to the test: Risk perception in traffic images. Preprint (2023). https://bazilinskyy.github.io/publications/driessen2023putting.pdf
Yuan, Z., Wang, X., Wang, K., & Sun, L.  ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter. arXiv. (2024). https://doi.org/10.2305.07490
Lee, S., Lin, H. P., Park, J., Lim, E., & Woo, J. NLP Models Classifying Helpful Ratings in OpenTable Dataset. International Conference on Internet (ICONI) 2023. (2023). https://www.calstatela.edu/sites/default/files/opentableRatingNLP_ICONI_2023.pdf
Chen, Y. & Xia, F. Restaurants’ Rating Prediction Using Yelp Dataset. 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications( AEECA), Dalian, China, 2020, pp. 113-117, https://doi.org/10.1109/AEECA49918.2020.9213704.
Rodríguez-de-Vera, J. M., Villacorta, P., Estepa, I. G., Bolaños, M., Sarasúa, I., Nagarajan, B., & Radeva, P.  Dining on Details: LLM-Guided Expert Networks for Fine-Grained Food Recognition. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management (MADiMa '23) (pp. 43–52). Association for Computing Machinery. (2023). https://doi.org/10.1145/3607828.3617797
Liao, H., Li, Y., Hu, T., & Luo, J. Inferring restaurant styles by mining crowd-sourced photos from user-review websites. In 2016 IEEE International Conference on Big Data (IEEE BigData 2016) (pp. 937–944). Washington, DC, USA: IEEE Computer Society. (2016). https://doi.org/10.1109/BIGDATA.2016.7840690.
Krupić, D., Corr, P. J., & Satchell, L. Assessment: Methods, Data, and Interpretation. In P. J. Corr & D. Krupić (Ed.). Personality & Intelligence: The Psychology of Individual Differences. Oxford: Oxford University Press. (2024).
Bujang, M. A., & Baharum, N. A simplified guide to determination of sample size requirements for estimating the value of intraclass correlation coefficient: a review. Archives of Orofacial Science, 12, 1-11. (2017). https://pesquisa.bvsalud.org/portal/resource/pt/wpr-625452
Aher, G. V., Arriaga, R. I., & Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning (pp. 337-371). PMLR. (2023). https://proceedings.mlr.press/v202/aher23a.html
Shrout, P. E., & Fleiss, J. L.. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86, 420-428. (1979) https://doi.org/10.1037/0033-2909.86.2.420
Koo, T. K., & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15, 155-163. (2016). https://doi.org/10.1016/j.jcm.2016.02.012

There is NO Competing Interest.

Supplementaryinformation.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Evaluating the Agreement between Human Preferences, GPT-4V and Gemini Pro Vision Assessments: Can AI Recognise Which Restaurants People Might Like?

Status:

Version 1

Abstract

Figures

1. Introduction

2. Results

3. Discussion

4. Method

References

Additional Declarations

Supplementary Files

Status:

Version 1