A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on reinforcement learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal tradeoffs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.