Query:
学者姓名:张菁
Refining:
Year
Type
Indexed by
Source
Complex
Co-Author
Language
Clean All
Abstract :
Domain adaptation object detection (DAOD) uses the labeled data of one scene (i.e., the source domain) and the unlabeled data of another unfamiliar scene (i.e., the target domain) to train the cross-domain object detector. Most existing methods align the overall distribution of features by adversarial adaptive methods. Despite their success, these methods are primarily designed for two-stage detectors that are challenging to deploy, resulting in limited practical applications. In addition, owing to the instability of adversarial domain discriminator training, inducing the detector is difficult using only an adversarial adaptive strategy to extract instance-level domain-invariant features to align the overall distribution. To address these issues, we propose a new cross-domain object detection framework based on the You Only Look Once (YOLO) series of algorithms named Disentanglement Representation YOLO (DRY). The developed method achieves feature disentanglement in the channel dimension and spatial dimensions through domain-invariant feature disentanglement (DIFD) and instance-level feature disentanglement (ILFD) modules, respectively, prompting the detector to extract domain-invariant features. Experiments demonstrate that our model outperforms existing methods. It achieved an average accuracy value of 42.7 on the Cityscapes to FoggyCityscapes benchmark and significantly outperformed all other methods on human and car objects. The average accuracy values of 49.0 and 49.5 achieved on the SIM10K to Cityscapes and KITTI to Cityscapes scenarios, respectively, are superior to those of existing methods. Extensive experimental results on various datasets verify that the proposed DRY method is effective and widely applicable. The code is available at https://github.com/BJUTsipl/DRY.
Keyword :
Object detection Object detection Cross-domain detection Cross-domain detection Unsupervised domain adaptation Unsupervised domain adaptation Disentangled representation learning Disentangled representation learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Jiafeng , Zhi, Mengxun , Zheng, Yongyu et al. Coarse-to-fine domain adaptation object detection with feature disentanglement [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2025 . |
MLA | Li, Jiafeng et al. "Coarse-to-fine domain adaptation object detection with feature disentanglement" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2025) . |
APA | Li, Jiafeng , Zhi, Mengxun , Zheng, Yongyu , Zhuo, Li , Zhang, Jing . Coarse-to-fine domain adaptation object detection with feature disentanglement . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2025 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Oriented object detection (OOD) in remote sensing images (RSIs) is of increasing interest. Since RSIs often contain many oriented objects, it is valuable and challenging to discover geometric invariance of geospatial objects to improve the model's perception of rotation angle and scale. In this paper, we propose a twin-tower detector (T(2)Det) for OOD in RSIs. Specifically, T(2)Det overcomes the challenges posed by the angles and scales of oriented object by developing a self-supervised (SS) branch that exploits geometric invariance based on the main branch. Then, we design a twin-tower (T-2) loss function to enhance the network's ability to perceive the geometric invariance of geospatial object, where a coarse loss function and a fine loss function are introduced for both branches to optimize the model from coarse to fine. In addition, T-2 loss function optimization strategy based on global or refinement modes is developed to achieve the trade-off between the main branch and the SS branch. On three benchmark datasets, including VEDAI, HRSC2016, and NUAA-SIRST, our T(2)Det achieves competitive performance of 85.15%, 90.66% mAP, and 99.28 P-d, respectively, without unnecessary extra features.
Keyword :
geometric invariance geometric invariance twin-tower detector twin-tower detector Remote sensing images Remote sensing images oriented object detection oriented object detection self-supervised learning self-supervised learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Wang, Liuqian , Zhang, Jing , Li, Jiafeng et al. T2Det: twin-tower detector with geometric invariance for oriented object detection [J]. | REMOTE SENSING LETTERS , 2025 , 16 (5) : 494-505 . |
MLA | Wang, Liuqian et al. "T2Det: twin-tower detector with geometric invariance for oriented object detection" . | REMOTE SENSING LETTERS 16 . 5 (2025) : 494-505 . |
APA | Wang, Liuqian , Zhang, Jing , Li, Jiafeng , Zhuo, Li . T2Det: twin-tower detector with geometric invariance for oriented object detection . | REMOTE SENSING LETTERS , 2025 , 16 (5) , 494-505 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Graph convolutional networks (graph models for short) are crucial for understanding model decisions through mathematical white-box interpretation, which can radically improve the performance and credibility of downstream artificial intelligence applications. To address the limitations of existing interpretability of over- smoothing and over-squashing, we propose an explainable graph model based on nonlinear catastrophe theory and apply it to group activity recognition to validate the usefulness of interpretability. (1) We introduce catastrophe mathematical theory to explore the internal processes of graph models and construct the explainable dynamical equations of the graph convolutional network; (2) When graph node features lose uniqueness, leading to over-smoothing, which reduces the discriminative power of the graph model, we propose a mathematical method to predict over-smoothing; (3) In response to the over-squashing of the node feature values that is excessively compressed, we design a channel expansion unit to extend the transmission paths of graph nodes and alleviate the over-squashing in the graph structure. Finally, we apply our model to group activity recognition tasks to capture complex interactions within groups. We obtain the competitive results on five publicly available graph structure datasets (Actor, Chameleon, Texas, Cornell, Cora) and our self-built group activity dataset. Our model can effectively capture node and graph-level features with stronger generalization capabilities. For complex and diverse real-world group activity data, our model offers intuitive graph-level explanations for group activity analysis. Through the analysis of over-smoothing and over-squashing, our method extends new theoretical approaches in explainable artificial intelligence.
Keyword :
Explainable Explainable Group activity recognition Group activity recognition Graph convolutional network Graph convolutional network Over-smoothing Over-smoothing Catastrophe theory Catastrophe theory
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Kang, Junpeng , Zhang, Jing , Chen, Lin et al. Explainable graph convolutional network based on catastrophe theory and its application to group activity recognition [J]. | ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE , 2025 , 150 . |
MLA | Kang, Junpeng et al. "Explainable graph convolutional network based on catastrophe theory and its application to group activity recognition" . | ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE 150 (2025) . |
APA | Kang, Junpeng , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . Explainable graph convolutional network based on catastrophe theory and its application to group activity recognition . | ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE , 2025 , 150 . |
Export to | NoteExpress RIS BibTex |
Abstract :
The rampant use of forgery techniques poses a significant threat to the security of celebrities' identities. Although current deepfake detection methods have shown effectiveness when dealing with specific public face forgery datasets, their reliability diminishes when applied to open data. Moreover, these methods are susceptible to re-compression and mainly rely on pixel-level abnormalities in forgery faces. In this study, we present a novel approach to detecting face forgery by leveraging individual speaking patterns of facial expressions and head movements. Our method utilizes potential motion patterns and inter-frame variations to effectively differentiate between fake and real videos. We propose an end-to-end dual-branch detection network, named the spatial-temporal transformer (STT), which aims to safeguard the identity of the person-of-interest (POI) from deepfaking. The STT incorporates the spatial transformer (ST) to establish the connection between facial expressions and head movements, while the temporal transformer (TT) exploits inconsistencies in facial attribute changes. Additionally, we introduce a central compression loss to enhance the detection performance. Extensive experiments are conducted to evaluate the effectiveness of the STT, and the results demonstrate its superiority over other SOTA methods in detecting forgery videos involving POIs. Furthermore, our network exhibits resilience to pixel-level re-compression perturbations, making it a robust solution in the face of evolving forgery techniques.
Keyword :
Central compression loss Central compression loss Speaking pattern Speaking pattern Person-of-interest Person-of-interest Spatial-temporal transformer Spatial-temporal transformer Deepfake video detection Deepfake video detection Eye gaze Eye gaze
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Lu, Dingyu , Liu, Zihou , Zhang, Dongming et al. Spatial-temporal transformer network for protecting person-of-interest from deepfaking [J]. | MULTIMEDIA SYSTEMS , 2025 , 31 (1) . |
MLA | Lu, Dingyu et al. "Spatial-temporal transformer network for protecting person-of-interest from deepfaking" . | MULTIMEDIA SYSTEMS 31 . 1 (2025) . |
APA | Lu, Dingyu , Liu, Zihou , Zhang, Dongming , Zhang, Jing , Jin, Guoqing . Spatial-temporal transformer network for protecting person-of-interest from deepfaking . | MULTIMEDIA SYSTEMS , 2025 , 31 (1) . |
Export to | NoteExpress RIS BibTex |
Abstract :
Group activity recognition can remarkably improve the understanding of video content by analyzing human behaviors and activities in videos. We propose a random walk graph convolutional network (RWGCN) for group activity recognition. (1) Considering the limitation of the convolutional structure to the visual information of group activities, the position feature extraction module is used to compensate for the loss of visual information. (2) A graph convolutional network (GCN) with distance-adaptive edge relations is constructed using individuals as graph nodes to identify the intrinsic relationships among the individuals in the group activities. (3) A Levy flight random walk mechanism is introduced into the GCN to obtain information from different nodes and integrate the previous position information to recognize group activity. Extensive experiments on the publicly available CAD, CAE datasets, and self-built BJUT-GAD dataset show that our RWGCN achieves MPCA of 95.49%, 94.82%, and 96.02%, respectively, which provides a better competitiveness in group activity recognition compared to other methods.
Keyword :
Random walk Random walk Graph convolutional network Graph convolutional network Group activity recognition Group activity recognition Levy flight Levy flight Position information Position information
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Kang, Junpeng , Zhang, Jing , Chen, Lin et al. RWGCN: Random walk graph convolutional network for group activity recognition [J]. | APPLIED INTELLIGENCE , 2025 , 55 (6) . |
MLA | Kang, Junpeng et al. "RWGCN: Random walk graph convolutional network for group activity recognition" . | APPLIED INTELLIGENCE 55 . 6 (2025) . |
APA | Kang, Junpeng , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . RWGCN: Random walk graph convolutional network for group activity recognition . | APPLIED INTELLIGENCE , 2025 , 55 (6) . |
Export to | NoteExpress RIS BibTex |
Abstract :
Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.
Keyword :
Memory knowledge propagation Memory knowledge propagation Point-supervised Point-supervised Livestreaming Livestreaming Dual optimization loss Dual optimization loss Temporal action localization Temporal action localization
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Chen, Lin , Zhang, Jing , Zhang, Yian et al. MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming [J]. | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
MLA | Chen, Lin et al. "MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming" . | COMPUTER VISION AND IMAGE UNDERSTANDING 248 (2024) . |
APA | Chen, Lin , Zhang, Jing , Zhang, Yian , Kang, Junpeng , Zhuo, Li . MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming . | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Livestreaming platforms attract many active streamers and daily users, and their public opinion power poses a major challenge to network regulation. Video scene understanding can promote the efficiency and quality of network regulation, in which video instance segmentation is a fundamental task for scene understanding. Due to the presence of small, dense instances and fast-changing scenes in livestreaming scenarios, we propose a Gaussian prior tri-cascaded Transformer Gp3Former for video instance segmentation. First, the Mask2Former-VIS encoder is used to enhance the representation of video features at different scales for small instance segmentation. Then, a tri-cascaded Transformer decoder is designed to adapt to the fast-changing scenes in livestreaming, which can extract global, balanced, and local instance features while sacrificing as little scene information as possible. Finally, to cope with the dense instances in livestreaming, a Gaussian prior is imposed during instance association and segmentation to learn the Gaussian distribution of a series of cross-frame instances. The experimental results show that with an inference efficiency of 19.6 FPS, the proposed method reaches 50.6%AP, 50.0%AR on YouTube-VIS 2019, and 82.9%AP, 82.3%AR on self-built BJUT-LSD, respectively, which is effective and superior for video instance segmentation of livestreaming scenarios.
Keyword :
Livestreaming Livestreaming tri-cascaded tri-cascaded video scenarios video scenarios Gaussian prior Gaussian prior video instance segmentation video instance segmentation
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios [J]. | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 , 9 (1) : 770-784 . |
MLA | Li, Wensheng et al. "Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios" . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 9 . 1 (2024) : 770-784 . |
APA | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 , 9 (1) , 770-784 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Since the large-scale annotation of streamer actions is expensive, training with generic action data is a practical approach. Nevertheless, the spatiotemporal differences between generic actions and streamer actions decrease the recognition accuracy. Domain adaptation utilizes labeled data from both the source domain and target domain to mitigate the performance degradation of target domain data, but it relies on (1) the feature distribution of each category that satisfies the clustering assumption and (2) the distribution of features of the same category in different domains having minimal discrepancy. Considering that streamer action recognition in live video does not meet the above assumptions, we propose a domain adaptation method with optimized feature distribution for streamer action recognition in live video. The method generates diverse features for each sample through the style transfer module and then uses the proposed metric learning loss to constrain the features in a similar feature space to satisfy the above assumptions. The experimental results show that our method has an accuracy of 86.35%, which exceeds the SOTA by 4.71% and an inference speed of 1500 FPS, which is capable of performing the task of streamer action recognition in live video.
Keyword :
Optimized feature distribution Optimized feature distribution Action recognition Action recognition Live video Live video Domain adaptation Domain adaptation Streamer Streamer
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | He, Chen , Zhang, Jing , Chen, Lin et al. Domain adaptation with optimized feature distribution for streamer action recognition in live video [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (1) : 107-125 . |
MLA | He, Chen et al. "Domain adaptation with optimized feature distribution for streamer action recognition in live video" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS 16 . 1 (2024) : 107-125 . |
APA | He, Chen , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . Domain adaptation with optimized feature distribution for streamer action recognition in live video . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (1) , 107-125 . |
Export to | NoteExpress RIS BibTex |
Abstract :
The detection of unknown objects is a challenging task in computer vision because, although there are diverse real-world detection object categories, existing object-detection training sets cover a limited number of object categories . Most existing approaches use two-stage networks to improve a model's ability to characterize objects of unknown classes, which leads to slow inference. To address this issue, we proposed a single-stage unknown object detection method based on the contrastive language-image pre-training (CLIP) model and pseudo-labelling, called CLIP-YOLO. First, a visual language embedding alignment method is introduced and a channel-grouped enhanced coordinate attention module is embedded into a YOLO-series detection head and feature-enhancing component, to improve the model's ability to characterize and detect unknown category objects. Second, the pseudo-labelling generation is optimized based on the CLIP model to expand the diversity of the training set and enhance the ability to cover unknown object categories. We validated this method on four challenging datasets: MSCOCO, ILSVRC, Visual Genome, and PASCAL VOC. The results show that our method can achieve higher accuracy and faster speed, so as to obtain better performance of unknown object detection. The source code is available at https://github.com/BJUTsipl/CLIP-YOLO.
Keyword :
Single-stage Single-stage Pseudo-labeling Pseudo-labeling Zero-shot detection Zero-shot detection CLIP CLIP
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Jiafeng , Sun, Shengyao , Zhang, Kang et al. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (2) : 1055-1070 . |
MLA | Li, Jiafeng et al. "Single-stage zero-shot object detection network based on CLIP and pseudo-labeling" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS 16 . 2 (2024) : 1055-1070 . |
APA | Li, Jiafeng , Sun, Shengyao , Zhang, Kang , Zhang, Jing , Zhuo, Li . Single-stage zero-shot object detection network based on CLIP and pseudo-labeling . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 , 16 (2) , 1055-1070 . |
Export to | NoteExpress RIS BibTex |
Abstract :
High-resolution remote sensing images (HR-RSIs) have a strong dependency between geospatial objects and background. Considering the complex spatial structure and multiscale objects in HR-RSIs, how to fully mine spatial information directly determines the quality of semantic segmentation. In this paper, we focus on the Spatial-specific Transformer with involution for semantic segmentation of HR-RSIs. First, we integrate the spatial-specific involution branch with self-attention branch to form a Spatial-specific Transformer backbone to produce multilevel features with global and spatial information without additional parameters. Then, we introduce multiscale feature representation with large window attention into Swin Transformer to capture multiscale contextual information. Finally, we add a geospatial feature supplement branch in the semantic segmentation decoder to mitigate the loss of semantic information caused by down-sampling multiscale features of geospatial objects. Experimental results demonstrate that our method can achieve a competitive semantic segmentation performance of 87.61% and 80.08% mIoU on Potsdam and Vaihingen datasets, respectively.
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Wu, Xinjia , Zhang, Jing , Li, Wensheng et al. Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images [J]. | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) : 1280-1307 . |
MLA | Wu, Xinjia et al. "Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images" . | INTERNATIONAL JOURNAL OF REMOTE SENSING 44 . 4 (2023) : 1280-1307 . |
APA | Wu, Xinjia , Zhang, Jing , Li, Wensheng , Li, Jiafeng , Zhuo, Li , Zhang, Jie . Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images . | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) , 1280-1307 . |
Export to | NoteExpress RIS BibTex |
Export
Results: |
Selected to |
Format: |