Query:
学者姓名:张菁
Refining:
Year
Type
Indexed by
Source
Complex
Co-Author
Language
Clean All
Abstract :
Oriented object detection (OOD) in remote sensing images (RSIs) is of increasing interest. Since RSIs often contain many oriented objects, it is valuable and challenging to discover geometric invariance of geospatial objects to improve the model's perception of rotation angle and scale. In this paper, we propose a twin-tower detector (T(2)Det) for OOD in RSIs. Specifically, T(2)Det overcomes the challenges posed by the angles and scales of oriented object by developing a self-supervised (SS) branch that exploits geometric invariance based on the main branch. Then, we design a twin-tower (T-2) loss function to enhance the network's ability to perceive the geometric invariance of geospatial object, where a coarse loss function and a fine loss function are introduced for both branches to optimize the model from coarse to fine. In addition, T-2 loss function optimization strategy based on global or refinement modes is developed to achieve the trade-off between the main branch and the SS branch. On three benchmark datasets, including VEDAI, HRSC2016, and NUAA-SIRST, our T(2)Det achieves competitive performance of 85.15%, 90.66% mAP, and 99.28 P-d, respectively, without unnecessary extra features.
Keyword :
geometric invariance geometric invariance twin-tower detector twin-tower detector Remote sensing images Remote sensing images oriented object detection oriented object detection self-supervised learning self-supervised learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Wang, Liuqian , Zhang, Jing , Li, Jiafeng et al. T2Det: twin-tower detector with geometric invariance for oriented object detection [J]. | REMOTE SENSING LETTERS , 2025 , 16 (5) : 494-505 . |
MLA | Wang, Liuqian et al. "T2Det: twin-tower detector with geometric invariance for oriented object detection" . | REMOTE SENSING LETTERS 16 . 5 (2025) : 494-505 . |
APA | Wang, Liuqian , Zhang, Jing , Li, Jiafeng , Zhuo, Li . T2Det: twin-tower detector with geometric invariance for oriented object detection . | REMOTE SENSING LETTERS , 2025 , 16 (5) , 494-505 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Domain adaptation object detection (DAOD) uses the labeled data of one scene (i.e., the source domain) and the unlabeled data of another unfamiliar scene (i.e., the target domain) to train the cross-domain object detector. Most existing methods align the overall distribution of features by adversarial adaptive methods. Despite their success, these methods are primarily designed for two-stage detectors that are challenging to deploy, resulting in limited practical applications. In addition, owing to the instability of adversarial domain discriminator training, inducing the detector is difficult using only an adversarial adaptive strategy to extract instance-level domain-invariant features to align the overall distribution. To address these issues, we propose a new cross-domain object detection framework based on the You Only Look Once (YOLO) series of algorithms named Disentanglement Representation YOLO (DRY). The developed method achieves feature disentanglement in the channel dimension and spatial dimensions through domain-invariant feature disentanglement (DIFD) and instance-level feature disentanglement (ILFD) modules, respectively, prompting the detector to extract domain-invariant features. Experiments demonstrate that our model outperforms existing methods. It achieved an average accuracy value of 42.7 on the Cityscapes to FoggyCityscapes benchmark and significantly outperformed all other methods on human and car objects. The average accuracy values of 49.0 and 49.5 achieved on the SIM10K to Cityscapes and KITTI to Cityscapes scenarios, respectively, are superior to those of existing methods. Extensive experimental results on various datasets verify that the proposed DRY method is effective and widely applicable. The code is available at https://github.com/BJUTsipl/DRY.
Keyword :
Object detection Object detection Cross-domain detection Cross-domain detection Unsupervised domain adaptation Unsupervised domain adaptation Disentangled representation learning Disentangled representation learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Jiafeng , Zhi, Mengxun , Zheng, Yongyu et al. Coarse-to-fine domain adaptation object detection with feature disentanglement [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2025 . |
MLA | Li, Jiafeng et al. "Coarse-to-fine domain adaptation object detection with feature disentanglement" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2025) . |
APA | Li, Jiafeng , Zhi, Mengxun , Zheng, Yongyu , Zhuo, Li , Zhang, Jing . Coarse-to-fine domain adaptation object detection with feature disentanglement . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2025 . |
Export to | NoteExpress RIS BibTex |
Abstract :
The rampant use of forgery techniques poses a significant threat to the security of celebrities' identities. Although current deepfake detection methods have shown effectiveness when dealing with specific public face forgery datasets, their reliability diminishes when applied to open data. Moreover, these methods are susceptible to re-compression and mainly rely on pixel-level abnormalities in forgery faces. In this study, we present a novel approach to detecting face forgery by leveraging individual speaking patterns of facial expressions and head movements. Our method utilizes potential motion patterns and inter-frame variations to effectively differentiate between fake and real videos. We propose an end-to-end dual-branch detection network, named the spatial-temporal transformer (STT), which aims to safeguard the identity of the person-of-interest (POI) from deepfaking. The STT incorporates the spatial transformer (ST) to establish the connection between facial expressions and head movements, while the temporal transformer (TT) exploits inconsistencies in facial attribute changes. Additionally, we introduce a central compression loss to enhance the detection performance. Extensive experiments are conducted to evaluate the effectiveness of the STT, and the results demonstrate its superiority over other SOTA methods in detecting forgery videos involving POIs. Furthermore, our network exhibits resilience to pixel-level re-compression perturbations, making it a robust solution in the face of evolving forgery techniques.
Keyword :
Central compression loss Central compression loss Speaking pattern Speaking pattern Person-of-interest Person-of-interest Spatial-temporal transformer Spatial-temporal transformer Deepfake video detection Deepfake video detection Eye gaze Eye gaze
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Lu, Dingyu , Liu, Zihou , Zhang, Dongming et al. Spatial-temporal transformer network for protecting person-of-interest from deepfaking [J]. | MULTIMEDIA SYSTEMS , 2025 , 31 (1) . |
MLA | Lu, Dingyu et al. "Spatial-temporal transformer network for protecting person-of-interest from deepfaking" . | MULTIMEDIA SYSTEMS 31 . 1 (2025) . |
APA | Lu, Dingyu , Liu, Zihou , Zhang, Dongming , Zhang, Jing , Jin, Guoqing . Spatial-temporal transformer network for protecting person-of-interest from deepfaking . | MULTIMEDIA SYSTEMS , 2025 , 31 (1) . |
Export to | NoteExpress RIS BibTex |
Abstract :
Group activity recognition can remarkably improve the understanding of video content by analyzing human behaviors and activities in videos. We propose a random walk graph convolutional network (RWGCN) for group activity recognition. (1) Considering the limitation of the convolutional structure to the visual information of group activities, the position feature extraction module is used to compensate for the loss of visual information. (2) A graph convolutional network (GCN) with distance-adaptive edge relations is constructed using individuals as graph nodes to identify the intrinsic relationships among the individuals in the group activities. (3) A Levy flight random walk mechanism is introduced into the GCN to obtain information from different nodes and integrate the previous position information to recognize group activity. Extensive experiments on the publicly available CAD, CAE datasets, and self-built BJUT-GAD dataset show that our RWGCN achieves MPCA of 95.49%, 94.82%, and 96.02%, respectively, which provides a better competitiveness in group activity recognition compared to other methods.
Keyword :
Random walk Random walk Graph convolutional network Graph convolutional network Group activity recognition Group activity recognition Levy flight Levy flight Position information Position information
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Kang, Junpeng , Zhang, Jing , Chen, Lin et al. RWGCN: Random walk graph convolutional network for group activity recognition [J]. | APPLIED INTELLIGENCE , 2025 , 55 (6) . |
MLA | Kang, Junpeng et al. "RWGCN: Random walk graph convolutional network for group activity recognition" . | APPLIED INTELLIGENCE 55 . 6 (2025) . |
APA | Kang, Junpeng , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . RWGCN: Random walk graph convolutional network for group activity recognition . | APPLIED INTELLIGENCE , 2025 , 55 (6) . |
Export to | NoteExpress RIS BibTex |
Abstract :
Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.
Keyword :
Memory knowledge propagation Memory knowledge propagation Point-supervised Point-supervised Livestreaming Livestreaming Dual optimization loss Dual optimization loss Temporal action localization Temporal action localization
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Chen, Lin , Zhang, Jing , Zhang, Yian et al. MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming [J]. | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
MLA | Chen, Lin et al. "MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming" . | COMPUTER VISION AND IMAGE UNDERSTANDING 248 (2024) . |
APA | Chen, Lin , Zhang, Jing , Zhang, Yian , Kang, Junpeng , Zhuo, Li . MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming . | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Since the large-scale annotation of streamer actions is expensive, training with generic action data is a practical approach. Nevertheless, the spatiotemporal differences between generic actions and streamer actions decrease the recognition accuracy. Domain adaptation utilizes labeled data from both the source domain and target domain to mitigate the performance degradation of target domain data, but it relies on (1) the feature distribution of each category that satisfies the clustering assumption and (2) the distribution of features of the same category in different domains having minimal discrepancy. Considering that streamer action recognition in live video does not meet the above assumptions, we propose a domain adaptation method with optimized feature distribution for streamer action recognition in live video. The method generates diverse features for each sample through the style transfer module and then uses the proposed metric learning loss to constrain the features in a similar feature space to satisfy the above assumptions. The experimental results show that our method has an accuracy of 86.35%, which exceeds the SOTA by 4.71% and an inference speed of 1500 FPS, which is capable of performing the task of streamer action recognition in live video.
Keyword :
Optimized feature distribution Optimized feature distribution Action recognition Action recognition Live video Live video Domain adaptation Domain adaptation Streamer Streamer
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | He, Chen , Zhang, Jing , Chen, Lin et al. Domain adaptation with optimized feature distribution for streamer action recognition in live video [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (1) : 107-125 . |
MLA | He, Chen et al. "Domain adaptation with optimized feature distribution for streamer action recognition in live video" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS 16 . 1 (2024) : 107-125 . |
APA | He, Chen , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . Domain adaptation with optimized feature distribution for streamer action recognition in live video . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (1) , 107-125 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Livestreaming platforms attract many active streamers and daily users, and their public opinion power poses a major challenge to network regulation. Video scene understanding can promote the efficiency and quality of network regulation, in which video instance segmentation is a fundamental task for scene understanding. Due to the presence of small, dense instances and fast-changing scenes in livestreaming scenarios, we propose a Gaussian prior tri-cascaded Transformer Gp3Former for video instance segmentation. First, the Mask2Former-VIS encoder is used to enhance the representation of video features at different scales for small instance segmentation. Then, a tri-cascaded Transformer decoder is designed to adapt to the fast-changing scenes in livestreaming, which can extract global, balanced, and local instance features while sacrificing as little scene information as possible. Finally, to cope with the dense instances in livestreaming, a Gaussian prior is imposed during instance association and segmentation to learn the Gaussian distribution of a series of cross-frame instances. The experimental results show that with an inference efficiency of 19.6 FPS, the proposed method reaches 50.6%AP, 50.0%AR on YouTube-VIS 2019, and 82.9%AP, 82.3%AR on self-built BJUT-LSD, respectively, which is effective and superior for video instance segmentation of livestreaming scenarios.
Keyword :
Livestreaming Livestreaming tri-cascaded tri-cascaded video scenarios video scenarios Gaussian prior Gaussian prior video instance segmentation video instance segmentation
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios [J]. | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 . |
MLA | Li, Wensheng et al. "Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios" . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE (2024) . |
APA | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 . |
Export to | NoteExpress RIS BibTex |
Abstract :
The detection of unknown objects is a challenging task in computer vision because, although there are diverse real-world detection object categories, existing object-detection training sets cover a limited number of object categories . Most existing approaches use two-stage networks to improve a model's ability to characterize objects of unknown classes, which leads to slow inference. To address this issue, we proposed a single-stage unknown object detection method based on the contrastive language-image pre-training (CLIP) model and pseudo-labelling, called CLIP-YOLO. First, a visual language embedding alignment method is introduced and a channel-grouped enhanced coordinate attention module is embedded into a YOLO-series detection head and feature-enhancing component, to improve the model's ability to characterize and detect unknown category objects. Second, the pseudo-labelling generation is optimized based on the CLIP model to expand the diversity of the training set and enhance the ability to cover unknown object categories. We validated this method on four challenging datasets: MSCOCO, ILSVRC, Visual Genome, and PASCAL VOC. The results show that our method can achieve higher accuracy and faster speed, so as to obtain better performance of unknown object detection. The source code is available at https://github.com/BJUTsipl/CLIP-YOLO.
Keyword :
Single-stage Single-stage Pseudo-labeling Pseudo-labeling Zero-shot detection Zero-shot detection CLIP CLIP
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Jiafeng , Sun, Shengyao , Zhang, Kang et al. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
MLA | Li, Jiafeng et al. "Single-stage zero-shot object detection network based on CLIP and pseudo-labeling" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2024) . |
APA | Li, Jiafeng , Sun, Shengyao , Zhang, Kang , Zhang, Jing , Zhuo, Li . Single-stage zero-shot object detection network based on CLIP and pseudo-labeling . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
Export to | NoteExpress RIS BibTex |
Abstract :
High-resolution remote sensing images (HR-RSIs) have a strong dependency between geospatial objects and background. Considering the complex spatial structure and multiscale objects in HR-RSIs, how to fully mine spatial information directly determines the quality of semantic segmentation. In this paper, we focus on the Spatial-specific Transformer with involution for semantic segmentation of HR-RSIs. First, we integrate the spatial-specific involution branch with self-attention branch to form a Spatial-specific Transformer backbone to produce multilevel features with global and spatial information without additional parameters. Then, we introduce multiscale feature representation with large window attention into Swin Transformer to capture multiscale contextual information. Finally, we add a geospatial feature supplement branch in the semantic segmentation decoder to mitigate the loss of semantic information caused by down-sampling multiscale features of geospatial objects. Experimental results demonstrate that our method can achieve a competitive semantic segmentation performance of 87.61% and 80.08% mIoU on Potsdam and Vaihingen datasets, respectively.
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Wu, Xinjia , Zhang, Jing , Li, Wensheng et al. Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images [J]. | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) : 1280-1307 . |
MLA | Wu, Xinjia et al. "Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images" . | INTERNATIONAL JOURNAL OF REMOTE SENSING 44 . 4 (2023) : 1280-1307 . |
APA | Wu, Xinjia , Zhang, Jing , Li, Wensheng , Li, Jiafeng , Zhuo, Li , Zhang, Jie . Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images . | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) , 1280-1307 . |
Export to | NoteExpress RIS BibTex |
Abstract :
As an emerging field of network content production, live video has been in the vacuum zone of cyberspace governance for a long time. Streamer action recognition is conducive to the supervision of live video content. In view of the diversity and imbalance of streamer actions, it is attractive to introduce few-shot learning to realize streamer action recognition. Therefore, a meta-learning paradigm and CosAttn for streamer action recognition method in live video is proposed, including: (1) the training set samples similar to the streamer action to be recognized are pretrained to improve the backbone network; (2) video-level features are extracted by R(2+1)D-18 backbone and global average pooling in the meta-learning paradigm; (3) the streamer action is recognized by calculating cosine similarity after sending the video-level features to CosAttn to generate a streamer action category prototype. Experimental results on several real-world action recognition datasets demonstrate the effectiveness of our method.
Keyword :
Prototypes Prototypes meta-learning paradigm meta-learning paradigm CosAttn CosAttn Optimization Optimization Streaming media Streaming media Training Training Feature extraction Feature extraction Testing Testing Live video Live video streamer action recognition streamer action recognition Task analysis Task analysis few-shot learning few-shot learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | He, Chen , Zhang, Jing , Yao, Jiacheng et al. Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video [J]. | IEEE SIGNAL PROCESSING LETTERS , 2022 , 29 : 1097-1101 . |
MLA | He, Chen et al. "Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video" . | IEEE SIGNAL PROCESSING LETTERS 29 (2022) : 1097-1101 . |
APA | He, Chen , Zhang, Jing , Yao, Jiacheng , Zhuo, Li , Tian, Qi . Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video . | IEEE SIGNAL PROCESSING LETTERS , 2022 , 29 , 1097-1101 . |
Export to | NoteExpress RIS BibTex |
Export
Results: |
Selected to |
Format: |