Seismic event detection and phase picking are the base of many seismological workflows. In recent years, several publications demonstrated that deep learning approaches significantly outperform classical approaches, achieving human-like performance under certain circumstances. However, as studies differ in the datasets and evaluation tasks, it is unclear how the different approaches compare to each other. Furthermore, there are no systematic studies about model performance in cross-domain scenarios, that is, when applied to data with different characteristics. Here, we address these questions by conducting a large-scale benchmark. We compare six previously published deep learning models on eight data sets covering local to teleseismic distances and on three tasks: event detection, phase identification and onset time picking. Furthermore, we compare the results to a classical Baer-Kradolfer picker. Overall, we observe the best performance for EQTransformer, GPD and PhaseNet, with a small advantage for EQTransformer on teleseismic data. Furthermore, we conduct a cross-domain study, analyzing model performance on data sets they were not trained on. We show that trained models can be transferred between regions with only mild performance degradation, but models trained on regional data do not transfer well to teleseismic data. As deep learning for detection and picking is a rapidly evolving field, we ensured extensibility of our benchmark by building our code on standardized frameworks and making it openly accessible. This allows model developers to easily evaluate new models or performance on new data sets. Furthermore, we make all trained models available through the SeisBench framework, giving end-users an easy way to apply these models.