UniTS: Unified Time Series Generative Model for Remote Sensing
Abstract
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a \textbf{Uni}fied \textbf{T}ime \textbf{S}eries Generative Model (\textbf{UniTS}), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting complex phenological variations.
Motivation
Based on the objectives and processing levels of time series analysis tasks, we categorize them into two broad classes: low-level (time series reconstruction & time series cloud removal) and high-level (time series semantic change detection & time series forecasting) vision tasks. The limitations of existing low-level and high-level time series tasks are as follows:
- Limited exploration in time series cloud removal tasks and challenges in dataset construction. Most existing studies focus on time-series reconstruction tasks based on simulated cloud masks, while there is limited research on time-series cloud removal under real and complex cloud-contaminated scenarios. Currently, the lack of high-quality benchmark datasets for time-series cloud removal severely restricts the performance evaluation of models in this field.
- Limited exploration in time series forecasting tasks. Current research on time series forecasting mainly relies on discriminative models (e.g., Pangu-Weather). In contrast, generative models have significant advantages in fitting the complex dynamic distribution of geographical spatiotemporality, yet they have not been fully explored. Furthermore, research on time series forecasting of raw reflectance remains largely unexplored, particularly for multi-spectral imagery with high spatiotemporal resolution (e.g., Sentinel-2).
- The absence of a unified framework capable of handling multiple remote sensing time-series tasks. Current research remains largely in the stage of developing specialized models for specific tasks, lacking a unified framework that can effectively address various remote sensing time series tasks.
Dataset
We construct two high-quality multimodal time-series datasets, namely TS-S12 and TS-S12CR. Among them, TS-S12 and TS-S12CR contain Sentinel-1 with 2 channels (VV and VH) and Sentinel-2 imagery with 10 spectral bands (excluding B1 Aerosols, B9 Water Vapor, and B10 Cirrus) from 14,973 and 12,126 ROIs around the world, respectively. The total storage capacity of these two datasets is approximately 2.2TB.
- TS-S12 provides aligned sample pairs of Sentinel-1 and cloud-free Sentinel-2 for time series reconstruction and forecasting tasks.
- TS-S12CR offers aligned triplets samples of Sentinel-1, cloud-covered Sentinel-2, and cloud-free Sentinel-2 specifically designed for time series cloud removal task. TS-S12CR provides an extreme scenario with an average cloud coverage of 84.02%, serving as an important benchmark for developing robust time series cloud removal methods.
Method
UniTS is implemented based on the standard Diffusion Transformer (DiT). Within the DiT framework, we introduce a spatio-temporal block and design two novel components: the Adaptive Condition Injector (ACor) and the Spatiotemporal-aware Modulator (STM). ACor adaptively injects multimodal conditional information (e.g., SAR and optical imagery) by dynamically generating affine transformation parameters, significantly enhancing the model's conditional perception of multimodal inputs across various time series tasks. Meanwhile, STM modulates attention weights in the spatio-temporal block by leveraging generated dynamic bias terms based on spatiotemporal priors, thereby strengthening the model's capacity to capture complex spatiotemporal dependencies.
(a) UniTS architecture, (b) Adaptive Condition Injector (ACor), (c) Spatiotemporal-aware Modulator (STM).
Results
Time Series Reconstruction
TableIV provides a quantitative comparison of time series reconstruction performance on the TS-S12 dataset.
Visualization of Time Series Reconstruction
Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Estado do Mato Grosso, Brasil, (16°05'21.1''S, 60°09'01.4''W))
Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Tandil, Argentina, (37°18'08''S, 59°06'52''W))
Time Series Cloud Removal
TableV provides a quantitative comparison of time series cloud removal performance on the TS-S12CR dataset.
Visualization of Time Series Cloud Removal
Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Estado do Mato Grosso, Brasil, (16°05'21.1''S, 60°09'01.4''W))
Visualization of time series cloud removal under S1 modality missing
To evaluate the robustness of the proposed method to modality absence, we compare the cloud removal performance under different training and inference configurations in Table VI.
Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Estado do Mato Grosso, Brasil, (16°05'21.1''S, 60°09'01.4''W))
Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Tandil, Argentina, (37°18'08''S, 59°06'52''W))
Time Series Semantic Change Detection
TableVII provides a quantitative comparison of time series semantic change detection performance on the DynamicEarthNet dataset.
TableVIII provides a quantitative comparison of time series semantic change detection performance on the MUDS dataset.
Visualization of Time Series Semantic Change Detection
Qualitative comparison of time series semantic change detection on DynamicEarthNet Dataset, presenting the RGB band of Planet here. Left: Semantic segmentation maps from T1 to T11, the mIoU value of each frames in the time series is marked; right: Binary Change Detection map (BCD) and Semantic Change Detection map (SCD).
Time Series Forecasting
TableIX provides a quantitative comparison of time series forecasting performance on the TS-S12 dataset.
Visualization of Time Series Forecasting
Qualitative comparison of time series forecasting on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Boston, USA, (42°18'37.8''S, 71°06'31.0''W))
Qualitative comparison of time series forecasting on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked. (Szentdenes, Hungary, (46°00'35.6''N, 17°55'12.0''E))
BibTeX
@article{zhang2025unitsunifiedtimeseries,
title={UniTS: Unified Time Series Generative Model for Remote Sensing},
author={Yuxiang Zhang and Shunlin Liang and Wenyuan Li and Han Ma and Jianglei Xu and Yichuan Ma and Jiangwei Xie and Wei Li and Mengmeng Zhang and Ran Tao and Xiang-Gen Xia},
year={2025},
eprint={2512.04461},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2512.04461},
}