3D-aware image synthesis involves various tasks, such as generating scenes and synthesizing novel views from images. Despite the existence of many methods specific to different tasks, it remains challenging to develop a comprehensive model. In this paper, we introduce SSDNeRF, which is a unified approach that utilizes an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous research has employed two-stage approaches that depend on pretrained NeRFs as real data for training diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that simultaneously optimizes a NeRF auto-decoder and a latent diffusion model. This enables us to achieve simultaneous 3D reconstruction and prior learning, even when only sparse views are available. During testing, we can directly sample the diffusion prior for unconditional generation or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results that are comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction.