CVPR-2024-卫星遥感图像相关论文 16篇
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
文章解读: http://www.studyai.com/xueshu/paper/detail/00ffce4794
摘要
Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains allowing users to hold a dialogue about given visual content.
However such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios leading to inaccurate or fabricated information when presented with RS domain-specific queries.
Such a behavior emerges due to the unique challenges introduced by RS imagery.
For example to handle high-resolution RS imagery with diverse scale changes across categories and many small objects region-level reasoning is necessary alongside holistic scene interpretation.
Furthermore the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries.
To address these limitations we propose GeoChat the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images.
Specifically GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue.
Furthermore it can visually ground objects in its responses by referring to their spatial coordinates.
To address the lack of domain-specific datasets we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets.
Leveraging this rich dataset we fine-tune our remote sensing VLM based on the LLaVA-1.5 architecture.
We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods.
GeoChat demonstrates robust zero-shot performance on various remote sensing tasks e.g.
image and region captioning visual question answering scene classification visually grounded conversations and referring object detection.
Our codes will be open-sourced…
Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model
文章解读: http://www.studyai.com/xueshu/paper/detail/0d40aa4b71
摘要
Reference-based super-resolution (RefSR) has the potential to build bridges across spatial and temporal resolutions of remote sensing images.
However existing RefSR methods are limited by the faithfulness of content reconstruction and the effectiveness of texture transfer in large scaling factors.
Conditional diffusion models have opened up new opportunities for generating realistic high-resolution images but effectively utilizing reference images within these models remains an area for further exploration.
Furthermore content fidelity is difficult to guarantee in areas without relevant reference information.
To solve these issues we propose a change-aware diffusion model named Ref-Diff for RefSR using the land cover change priors to guide the denoising process explicitly.
Specifically we inject the priors into the denoising model to improve the utilization of reference information in unchanged areas and regulate the reconstruction of semantically relevant content in changed areas.
With this powerful guidance we decouple the semantics-guided denoising and reference texture-guided denoising processes to improve the model performance.
Extensive experiments demonstrate the superior effectiveness and robustness of the proposed method compared with state-of-the-art RefSR methods in both quantitative and qualitative evaluations.
The code and data are available at https://github.com/dongrunmin/RefDiff…
SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation
文章解读: http://www.studyai.com/xueshu/paper/detail/2cda20165c
摘要
In recent years semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery.
Yet a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts.
In this work we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
The main idea is to learn the joint data manifold of images and labels leveraging recent advancements in denoising diffusion probabilistic models.
To the best of our knowledge we are the first to generate both images and corresponding masks for satellite segmentation.
We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity.
Both aspects are crucial for earth observation data where semantic classes can vary severely in scale and occurrence frequency.
We employ the novel data instances for downstream segmentation as a form of data augmentation.
In our experiments we provide comparisons to prior works based on discriminative diffusion models or GANs.
We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -both compared to baselines and when training only on the original data…
Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
文章解读: http://www.studyai.com/xueshu/paper/detail/36cc857c25
摘要
Currently machine learning-based methods for remote sensing pansharpening have progressed rapidly.
However existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces thereby limiting the effectiveness of the methods and resulting in redundant learning parameters.
In this paper we introduce a so-called content-adaptive non-local convolution (CANConv) a novel method tailored for remote sensing image pansharpening.
Specifically CANConv employs adaptive convolution ensuring spatial adaptability and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partition-wise adaptive convolution (PWAC) sub-modules.
Furthermore we also propose a corresponding network architecture called CANNet which mainly utilizes the multi-scale self-similarity.
Extensive experiments demonstrate the superior performance of CANConv compared with recent promising fusion methods.
Besides we substantiate the method’s effectiveness through visualization ablation experiments and comparison with existing methods on multiple test sets.
The source code is publicly available at https://github.com/duanyll/CANConv…
Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization
文章解读: http://www.studyai.com/xueshu/paper/detail/40f2ce0e99
摘要
This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL) encompassing both unsupervised and semi-supervised settings.
Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training.
However the cost of collecting precise cross-view image pairs hinders the deployment of CVGL in real-life scenarios.
Without the pairs CVGL will be more challenging to handle the significant imaging and spatial gaps between ground and satellite images.
To this end we propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that “the perfectly paired ground-satellite image is located in a unique and identical scene”.
The framework exhibits competitive performance compared with supervised works on three open-source benchmarks.
Our code and models will be released on https://github.com/liguopeng0923/UCVGL…
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
文章解读: http://www.studyai.com/xueshu/paper/detail/5bd4f6ebb5
摘要
Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation.
Nevertheless these works primarily focus on a single modality without temporal and geo-context modeling hampering their capabilities for diverse tasks.
In this study we present SkySense a generic billion-scale model pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences.
SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input.
This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities.
To further enhance the RSI representations by the geo-context clue we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI’s multi-modal spatiotemporal features.
To our best knowledge SkySense is the largest Multi-Modal RSFM to date whose modules can be flexibly combined or used individually to accommodate various tasks.
It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks from singleto multi-modal static to temporal and classification to localization.
SkySense surpasses 18 recent RSFMs in all test scenarios.
Specifically it outperforms the latest models such as GFM SatLas and Scale-MAE by a large margin i.e.
2.76% 3.67% and 3.61% on average respectively.
We will release the pre-trained weights to facilitate future research and Earth Observation applications…
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
文章解读: http://www.studyai.com/xueshu/paper/detail/636ee1117d
摘要
Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery leading to suboptimal segmentation results.
To address these challenges we introduce the Rotated Multi-Scale Interaction Network (RMSIN) an innovative approach designed for the unique demands of RRSIS.
RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network.
Furthermore RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects a novel contribution that significantly enhances segmentation accuracy.
To assess the efficacy of RMSIN we have curated an expansive dataset comprising 17402 image-caption-mask triplets which is unparalleled in terms of scale and variety.
This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task ensuring a rigorous evaluation of performance.
Experimental evaluations demonstrate the exceptional performance of RMSIN surpassing existing state-of-the-art models by a significant margin.
Datasets and code are available at https://github.com/Lsan2401/RMSIN…
Parameter Efficient Self-Supervised Geospatial Domain Adaptation
文章解读: http://www.studyai.com/xueshu/paper/detail/72d2834c58
摘要
As large-scale foundation models become publicly available for different domains efficiently adapting them to individual downstream applications and additional data modalities has turned into a central challenge.
For example foundation models for geospatial and satellite remote sensing applications are commonly trained on large optical RGB or multi-spectral datasets although data from a wide variety of heterogeneous sensors are available in the remote sensing domain.
This leads to significant discrepancies between pre-training and downstream target data distributions for many important applications.
Fine-tuning large foundation models to bridge that gap incurs high computational cost and can be infeasible when target datasets are small.
In this paper we address the question of how large pre-trained foundational transformer models can be efficiently adapted to downstream remote sensing tasks involving different data modalities or limited dataset size.
We present a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different foundation models by 4-6% (absolute) across 8 remote sensing datasets while outperforming full fine-tuning when training only 1-2% of the model parameters.
Our method significantly improves label efficiency and increases few-shot accuracy by 6-10% on different datasets…
Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
文章解读: http://www.studyai.com/xueshu/paper/detail/7f4d5401e8
摘要
Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services.
However challenges arise from significant view changes and scene scale.
Previous efforts mainly focused on image or video generation lacking exploration into the adaptability of scene generation for arbitrary views.
Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery.
To overcome these limitations we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques.
Specifically our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first which is then transformed into a scene representation in a feed-forward manner.
The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency.
Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery…
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
文章解读: http://www.studyai.com/xueshu/paper/detail/9114631deb
摘要
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
Different from standard natural image datasets remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities.
Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality.
In this paper we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
Our proposed approach named SatMAE++ performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales.
Compared to existing works the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery.
Extensive experiments on six datasets reveal the merits of proposed contributions leading to state-of-the-art performance on all datasets.
SatMAE++ achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset…
S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
文章解读: http://www.studyai.com/xueshu/paper/detail/97d732ecc4
摘要
In the expansive domain of computer vision a myriad of pre-trained models are at our disposal.
However most of these models are designed for natural RGB images and prove inadequate for spectral remote sensing (RS) images.
Spectral RS images have two main traits: (1) multiple bands capturing diverse feature information (2) spatial alignment and consistent spectral sequencing within the spatial-spectral dimension.
In this paper we introduce Spatial-SpectralMAE (S2MAE) a specialized pre-trained architecture for spectral RS imagery.
S2MAE employs a 3D transformer for masked autoencoder modeling integrating learnable spectral-spatial embeddings with a 90% masking ratio.
The model efficiently captures local spectral consistency and spatial invariance using compact cube tokens demonstrating versatility to diverse input characteristics.
This adaptability facilitates progressive pretraining on extensive spectral datasets.
The effectiveness of S2MAE is validated through continuous pretraining on two sizable datasets totaling over a million training images.
The pre-trained model is subsequently applied to three distinct downstream tasks with in-depth ablation studies conducted to emphasize its efficacy…
Domain Prompt Learning with Quaternion Networks
文章解读: http://www.studyai.com/xueshu/paper/detail/a5eaa3592b
摘要
Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs).
However when adapting VLMs to specialized domains such as remote sensing and medical imaging domain prompt learning remains underexplored.
While large-scale domain-specific foundation models can help tackle this challenge their concentration on a single vision level makes it challenging to prompt both vision and language modalities.
To overcome this we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains using quaternion networks.
Specifically the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks.
Moreover we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features.
In this way quaternion networks can effectively mine the intermodal relationships in the specific domain facilitating domain-specific vision-language contrastive learning.
Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning…
Multi-modal Learning for Geospatial Vegetation Forecasting
文章解读: http://www.studyai.com/xueshu/paper/detail/c5a7f32810
摘要
Precise geospatial vegetation forecasting holds potential across diverse sectors including agriculture forestry humanitarian aid and carbon accounting.
To leverage the vast availability of satellite imagery for this task various works have applied deep neural networks for predicting multispectral images in photorealistic quality.
However the important area of vegetation dynamics has not been thoroughly explored.
Our study introduces GreenEarthNet the first dataset specifically designed for high-resolution vegetation forecasting and Contextformer a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe.
Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner.
The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling.
It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021 enabling cross-dataset model comparisons.
Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques.
This includes surpassing previous state-of-the-art models on EarthNet2021 as well as adapted models from time series forecasting and video prediction.
To the best of our knowledge this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes.
We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet…
Poly Kernel Inception Network for Remote Sensing Detection
文章解读: http://www.studyai.com/xueshu/paper/detail/cef0836ee4
摘要
Object detection in remote sensing images (RSIs) often suffers from several increasing challenges including the large variation in object scales and the diverse-ranging context.
Prior methods tried to address these challenges by expanding the spatial receptive field of the backbone either through large-kernel convolution or dilated convolution.
However the former typically introduces considerable background noise while the latter risks generating overly sparse feature representations.
In this paper we introduce the Poly Kernel Inception Network (PKINet) to handle the above challenges.
PKINet employs multi-scale convolution kernels without dilation to extract object features of varying scales and capture local context.
In addition a Context Anchor Attention (CAA) module is introduced in parallel to capture long-range contextual information.
These two components work jointly to advance the performance of PKINet on four challenging remote sensing object detection benchmarks namely DOTA-v1.0 DOTA-v1.5 HRSC2016 and DIOR-R…
Learned Representation-Guided Diffusion Models for Large-Image Generation
文章解读: http://www.studyai.com/xueshu/paper/detail/e5d225703e
摘要
To synthesize high-fidelity samples diffusion models typically require auxiliary data to guide the generation process.
However it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches.
Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information.
In this paper we posit that such representations are expressive enough to act as proxies to fine-grained human labels.
We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL.
Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images.
In addition we construct larger images by assembling spatially consistent patches inferred from SSL embeddings preserving long-range dependencies.
Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger image-scale classification tasks.
Our models are effective even on datasets not encountered during training demonstrating their robustness and generalizability.
Generating images from learned embeddings is agnostic to the source of the embeddings.
The SSL embeddings used to generate a large image can either be extracted from a reference image or sampled from an auxiliary model conditioned on any related modality (e.g.
class labels text genomic data).
As proof of concept we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions…
3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
文章解读: http://www.studyai.com/xueshu/paper/detail/f05c71e0a5
摘要
3D building reconstruction from monocular remote sensing images is an important and challenging research problem that has received increasing attention in recent years owing to its low cost of data acquisition and availability for large-scale applications.
However existing methods rely on expensive 3D-annotated samples for fully-supervised training restricting their application to large-scale cross-city scenarios.
In this work we propose MLS-BRN a multi-level supervised building reconstruction network that can flexibly utilize training samples with different annotation levels to achieve better reconstruction results in an end-to-end manner.
To alleviate the demand on full 3D supervision we design two new modules Pseudo Building Bbox Calculator and Roof-Offset guided Footprint Extractor as well as new tasks and training strategies for different types of samples.
Experimental results on several public and new datasets demonstrate that our proposed MLS-BRN achieves competitive performance using much fewer 3D-annotated samples and significantly improves the footprint extraction and 3D reconstruction performance compared with current state-of-the-art.
The code and datasets of this work will be released at https://github.com/opendatalab/MLS-BRN.git…