The rapid proliferation of internet data, particularly through social media, has amplified the need for effective sentiment analysis, including the complex task of sarcasm detection. This paper presents a novel multi-modal sarcasm detection model leveraging cue learning techniques to address the challenges posed by data scarcity, especially in low-resource languages. The proposed model builds upon the CLIP architecture, integrating text and image modalities to co-learn sarcasm cues. The methodology encompasses discrete prompt generation, learnable continuous vectors, and multi-modal fusion to enhance detection accuracy. The multi-modal fusion process demonstrates a symmetric integration of text and image data, leading to improved performance. Experimental results on the Twitter Multi-modal Sarcasm Detection Dataset (MSD) demonstrate significant performance improvements over traditional models, highlighting the model's robustness and adaptability in small-sample scenarios. This research contributes a practical solution for nuanced sentiment analysis, paving the way for advanced applications in public opinion monitoring and AI-driven decision-making processes.