Object reconstruction from 3D point clouds has been a long-standing research topic in computer vision and computer graphics, and achieved impressive progress. However, reconstruction from time-varying point clouds (a.k.a. 4D point clouds) remains overlooked. In this paper, we propose a new network architecture, namely RFNet-4D++, that jointly reconstructs 3D objects and their motion flows from 4D point clouds. The key insight is simultaneously performing both the tasks can leverage the individual ones, leading to improved overall performance. To achieve this ability, we design a compositional encoder to learn spatio-temporal representations of 4D point clouds based on dual cross-attention mechanism. In addition, we devise a joint-learning scheme for unsupervised learning of temporal vector fields and supervised learning of occupancy fields. This multi-task learning is achieved by jointly optimising loss functions and sharing spatio-temporal features. Experiments and analyses on benchmark datasets validate the effectiveness and efficiency of our method. As shown in experimental results, our method achieves state-of-the-art performance on both flow estimation and object reconstruction while performing much faster than existing methods in both training and inference. Our code and data are available at \url{https://github.com/hkust-vgd/RFNet-4D}.