The usage of wearable devices in daily life has grown rapidly with the advancements in sensor technologies. These devices primarily rely on optical sensors to capture videos from an egocentric perspective, known as First Person Vision (FPV). FPV videos possess distinct characteristics compared to third-person videos, such as significant ego-motions and frequent scene changes. Consequently, vision-based methods designed for third-person videos, where the camera is positioned away from events and actors, cannot be directly applied to egocentric videos. Thus, it is essential to propose new approaches capable of effectively analyzing egocentric videos and accurately integrating inputs from various sensors to address specific tasks. In this study, we propose an audio-visual decision fusion framework for egocentric activity recognition. Our framework leverages deep features and incorporates a two-stage decision fusion mechanism. In addition, we introduce a new publicly available dataset, Egocentric Outdoor Activity Dataset (EOAD), which comprises 30 different egocentric activities and 1392 video clips with audio. Experimental evaluations show that combining audio and visual information enhances activity recognition performance, yielding promising results compared to using a single modality or equally weighted decisions from multiple modalities.