Stress can be considered a mental/physiological reaction in conditions of high discomfort and challenging situations. The levels of stress can be reflected in both the physiological responses and speech signals of a person. In this work, we introduce a novel decision-level fusion framework for multimodal stress level detection based on physiological signals from wearable devices and user speech audio recordings. The physiological signals include Electrocardiograph (ECG), Respiration (RSP), and InertialMeasurement Unit (IMU) sensors equipped in a smart vest. A data collection protocol was introduced to receive data for the training of both the sensor-based and audio-based stress detection modules. Five subjects participated in the data collection, where both their physiological and audio signals were recorded. The analysis of physiological signals includes a massive feature extraction along with various fusion and feature selection methods. The audio analysis comprises a state-of-the-art feature extraction fed to a classifier to predict stress levels. Results from the analysis of audio and physiological signals are fused at a decision level for the final stress level detection, utilizing a machine learning algorithm. The whole framework was also tested in a pilot scenario as part of the XR4DRAMA project. We also made the training dataset publicly available.