We are focusing on animating 3D facial geometry from speech signals in this task. Existing works have primarily been deterministic, concentrating on learning a one-to-one mapping from speech signals to 3D face meshes on small datasets with limited speakers. While these models can produce high-quality lip articulation for speakers in the training set, they fail to capture the full range of 3D facial movements that accompany speech in real-world scenarios. The relationship between speech and facial motion is one-to-many, encompassing both inter-speaker and intra-speaker variations, necessitating a probabilistic approach. In this study, we identify and tackle key challenges that have hindered the development of probabilistic models: the lack of suitable datasets and metrics for training and evaluating them, as well as the difficulty of designing a model that can generate diverse results while remaining true to the strong conditioning signal provided by speech. We initially propose large-scale benchmark datasets and metrics that are appropriate for probabilistic modeling. Subsequently, we present a probabilistic model that achieves both diversity and fidelity to speech, surpassing other methods on the suggested benchmarks. Lastly, we demonstrate practical applications of probabilistic models trained on these large-scale datasets: we can generate varied speech-driven 3D facial motion that matches the styles of unseen speakers extracted from reference clips; and our synthetic meshes can enhance the performance of audio-visual models downstream.