Video understanding identifies and classifies various actions and events in the video. Many previous works, such as video annotations, have demonstrated promising success in generating general video understanding. However, a fine summary of human activities and their interactions using state-of-the-art video captioning techniques is still difficult to produce. The comprehensive explanation of human actions and collective behaviors is important information for real time CCTV video tracking, medical treatment, sports video analysis etc. This research suggests a form of video understanding that focuses primarily on identifying group activity by learning the similarities between the pair and the actors appearance. In order to measure the similarity between the pair appearances and construct an actor relations map, the Zero Mean Normalized Cross-Correlation (ZNCC) and the zeromean sum of absolute differences(ZSAD) are proposed to allow the graph convolution network (GCN) to learn how to distinguish group actions. We recommend that MNASNet be used as the backbone to retrieve features from any video frame. A visualization model is also developed to visualize every input video frame and predict individual behavior or collective activity with projected bounding boxes on a human object.