In the realm of social media, users frequently convey personal sentiments, with some potentially indicating cognitive distortions or suicidal tendencies. Timely recognition of such signs is pivotal for effective interventions. In response, we introduce two novel annotated datasets from Chinese social media, focused on cognitive distortions and suicidal risk classification. We propose a comprehensive benchmark using both supervised learning and large language models, especially from the GPT series, to evaluate performance on these datasets. To assess the capabilities of the large language models, we employed three strategies: zero-shot, few-shot, and fine-tuning. Furthermore, we deeply explored and analyzed the performance of these large language models from a psychological perspective, shedding light on their strengths and limitations in identifying and understanding complex human emotions. Our evaluations underscore a performance difference between the two approaches, with the models often challenged by subtle category distinctions. While GPT-4 consistently delivered strong results, GPT-3.5 showed marked improvement in suicide risk classification after fine-tuning. This research is groundbreaking in its evaluation of large language models for Chinese social media tasks, accentuating the models' potential in psychological contexts. All datasets and code are made available at: \url{https://github.com/HongzhiQ/SupervisedVsLLM-EfficacyEval}.