Multimodal remote sensing data, including spectral and LiDAR or photogrammetry, is crucial for achieving satisfactory land-use/land-cover classification results in urban scenes. So far, most studies have been conducted in a 2-D context. When 3-D information is available in the dataset, it is typically integrated with the 2-D data by rasterizing the 3-D data into 2-D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3-D data by restricting the model’s ability to learn 3-D spatial features directly from raw point clouds. In addition, it limits the generation of 3-D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3-D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3-D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3-D and 2-D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3-D fusion delivers competitive results compared to 2-D methods and offers more flexibility by providing 3-D predictions. These predictions can be projected onto 2-D maps, a capability that is not feasible in reverse. In addition, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3-D and the IEEE DFC2019.