Query:
学者姓名:张菁
Refining:
Year
Type
Indexed by
Source
Complex
Co-Author
Language
Clean All
Abstract :
The detection of unknown objects is a challenging task in computer vision because, although there are diverse real-world detection object categories, existing object-detection training sets cover a limited number of object categories . Most existing approaches use two-stage networks to improve a model's ability to characterize objects of unknown classes, which leads to slow inference. To address this issue, we proposed a single-stage unknown object detection method based on the contrastive language-image pre-training (CLIP) model and pseudo-labelling, called CLIP-YOLO. First, a visual language embedding alignment method is introduced and a channel-grouped enhanced coordinate attention module is embedded into a YOLO-series detection head and feature-enhancing component, to improve the model's ability to characterize and detect unknown category objects. Second, the pseudo-labelling generation is optimized based on the CLIP model to expand the diversity of the training set and enhance the ability to cover unknown object categories. We validated this method on four challenging datasets: MSCOCO, ILSVRC, Visual Genome, and PASCAL VOC. The results show that our method can achieve higher accuracy and faster speed, so as to obtain better performance of unknown object detection. The source code is available at https://github.com/BJUTsipl/CLIP-YOLO.
Keyword :
Single-stage Single-stage Pseudo-labeling Pseudo-labeling Zero-shot detection Zero-shot detection CLIP CLIP
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Jiafeng , Sun, Shengyao , Zhang, Kang et al. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
MLA | Li, Jiafeng et al. "Single-stage zero-shot object detection network based on CLIP and pseudo-labeling" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2024) . |
APA | Li, Jiafeng , Sun, Shengyao , Zhang, Kang , Zhang, Jing , Zhuo, Li . Single-stage zero-shot object detection network based on CLIP and pseudo-labeling . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Livestreaming platforms attract many active streamers and daily users, and their public opinion power poses a major challenge to network regulation. Video scene understanding can promote the efficiency and quality of network regulation, in which video instance segmentation is a fundamental task for scene understanding. Due to the presence of small, dense instances and fast-changing scenes in livestreaming scenarios, we propose a Gaussian prior tri-cascaded Transformer Gp3Former for video instance segmentation. First, the Mask2Former-VIS encoder is used to enhance the representation of video features at different scales for small instance segmentation. Then, a tri-cascaded Transformer decoder is designed to adapt to the fast-changing scenes in livestreaming, which can extract global, balanced, and local instance features while sacrificing as little scene information as possible. Finally, to cope with the dense instances in livestreaming, a Gaussian prior is imposed during instance association and segmentation to learn the Gaussian distribution of a series of cross-frame instances. The experimental results show that with an inference efficiency of 19.6 FPS, the proposed method reaches 50.6%AP, 50.0%AR on YouTube-VIS 2019, and 82.9%AP, 82.3%AR on self-built BJUT-LSD, respectively, which is effective and superior for video instance segmentation of livestreaming scenarios.
Keyword :
Livestreaming Livestreaming tri-cascaded tri-cascaded video scenarios video scenarios Gaussian prior Gaussian prior video instance segmentation video instance segmentation
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios [J]. | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 . |
MLA | Li, Wensheng et al. "Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios" . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE (2024) . |
APA | Li, Wensheng , Zhang, Jing , Zhuo, Li . Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios . | IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE , 2024 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.
Keyword :
Memory knowledge propagation Memory knowledge propagation Point-supervised Point-supervised Livestreaming Livestreaming Dual optimization loss Dual optimization loss Temporal action localization Temporal action localization
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Chen, Lin , Zhang, Jing , Zhang, Yian et al. MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming [J]. | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
MLA | Chen, Lin et al. "MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming" . | COMPUTER VISION AND IMAGE UNDERSTANDING 248 (2024) . |
APA | Chen, Lin , Zhang, Jing , Zhang, Yian , Kang, Junpeng , Zhuo, Li . MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming . | COMPUTER VISION AND IMAGE UNDERSTANDING , 2024 , 248 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Since the large-scale annotation of streamer actions is expensive, training with generic action data is a practical approach. Nevertheless, the spatiotemporal differences between generic actions and streamer actions decrease the recognition accuracy. Domain adaptation utilizes labeled data from both the source domain and target domain to mitigate the performance degradation of target domain data, but it relies on (1) the feature distribution of each category that satisfies the clustering assumption and (2) the distribution of features of the same category in different domains having minimal discrepancy. Considering that streamer action recognition in live video does not meet the above assumptions, we propose a domain adaptation method with optimized feature distribution for streamer action recognition in live video. The method generates diverse features for each sample through the style transfer module and then uses the proposed metric learning loss to constrain the features in a similar feature space to satisfy the above assumptions. The experimental results show that our method has an accuracy of 86.35%, which exceeds the SOTA by 4.71% and an inference speed of 1500 FPS, which is capable of performing the task of streamer action recognition in live video.
Keyword :
Optimized feature distribution Optimized feature distribution Action recognition Action recognition Live video Live video Domain adaptation Domain adaptation Streamer Streamer
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | He, Chen , Zhang, Jing , Chen, Lin et al. Domain adaptation with optimized feature distribution for streamer action recognition in live video [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
MLA | He, Chen et al. "Domain adaptation with optimized feature distribution for streamer action recognition in live video" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2024) . |
APA | He, Chen , Zhang, Jing , Chen, Lin , Zhang, Hui , Zhuo, Li . Domain adaptation with optimized feature distribution for streamer action recognition in live video . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2024 . |
Export to | NoteExpress RIS BibTex |
Abstract :
High-resolution remote sensing images (HR-RSIs) have a strong dependency between geospatial objects and background. Considering the complex spatial structure and multiscale objects in HR-RSIs, how to fully mine spatial information directly determines the quality of semantic segmentation. In this paper, we focus on the Spatial-specific Transformer with involution for semantic segmentation of HR-RSIs. First, we integrate the spatial-specific involution branch with self-attention branch to form a Spatial-specific Transformer backbone to produce multilevel features with global and spatial information without additional parameters. Then, we introduce multiscale feature representation with large window attention into Swin Transformer to capture multiscale contextual information. Finally, we add a geospatial feature supplement branch in the semantic segmentation decoder to mitigate the loss of semantic information caused by down-sampling multiscale features of geospatial objects. Experimental results demonstrate that our method can achieve a competitive semantic segmentation performance of 87.61% and 80.08% mIoU on Potsdam and Vaihingen datasets, respectively.
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Wu, Xinjia , Zhang, Jing , Li, Wensheng et al. Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images [J]. | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) : 1280-1307 . |
MLA | Wu, Xinjia et al. "Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images" . | INTERNATIONAL JOURNAL OF REMOTE SENSING 44 . 4 (2023) : 1280-1307 . |
APA | Wu, Xinjia , Zhang, Jing , Li, Wensheng , Li, Jiafeng , Zhuo, Li , Zhang, Jie . Spatial-specific Transformer with involution for semantic segmentation of high-resolution remote sensing images . | INTERNATIONAL JOURNAL OF REMOTE SENSING , 2023 , 44 (4) , 1280-1307 . |
Export to | NoteExpress RIS BibTex |
Abstract :
As one of the most prevalent we-media, short video has exponentially grown and gradually fallen into the disaster area of infringement. Video fingerprint extraction technology is conducive to the intelligent identification of short video. In view of various tampering attacks, a short video fingerprint extraction method from audio-visual fingerprint fusion to multi-index hashing is proposed, including: (1) the shot-level fingerprint of short video is extracted by audio-visual fingerprint fusion after analyzing the consistency to eliminate the uncertainty at the decision-making layer, in which the visual fingerprint is generated by R(2 + 1)D network, and the audio fingerprint is combined by extracting audio features with masked audio spectral keypoints (MASK) and convolutional recurrent neural network (CRNN); (2) the shot-level fingerprints are assembled into the data-level fingerprint of short video by constructing the data-shot-key frame relationship model of data structure; (3) the short video fingerprint is matched by measuring the weighted Hamming distance by creating the multi-index hashing of the data-level fingerprint. Five experiments are conducted on the CC_Web_Video dataset and the Moments_in_Time_Raw_v2 dataset, and the results show that our method can effectively raise the overall performance of short video fingerprint.
Keyword :
Multi-index hashing Multi-index hashing Fingerprint extraction Fingerprint extraction Data-shot-key frame Data-shot-key frame Short video Short video Audio-visual Audio-visual
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Zhang, Shuying , Zhang, Jing , Wang, Yizhou et al. Short video fingerprint extraction: from audio-visual fingerprint fusion to multi-index hashing [J]. | MULTIMEDIA SYSTEMS , 2022 , 29 (3) : 981-1000 . |
MLA | Zhang, Shuying et al. "Short video fingerprint extraction: from audio-visual fingerprint fusion to multi-index hashing" . | MULTIMEDIA SYSTEMS 29 . 3 (2022) : 981-1000 . |
APA | Zhang, Shuying , Zhang, Jing , Wang, Yizhou , Zhuo, Li . Short video fingerprint extraction: from audio-visual fingerprint fusion to multi-index hashing . | MULTIMEDIA SYSTEMS , 2022 , 29 (3) , 981-1000 . |
Export to | NoteExpress RIS BibTex |
Abstract :
As an emerging field of network content production, live video has been in the vacuum zone of cyberspace governance for a long time. Streamer action recognition is conducive to the supervision of live video content. In view of the diversity and imbalance of streamer actions, it is attractive to introduce few-shot learning to realize streamer action recognition. Therefore, a meta-learning paradigm and CosAttn for streamer action recognition method in live video is proposed, including: (1) the training set samples similar to the streamer action to be recognized are pretrained to improve the backbone network; (2) video-level features are extracted by R(2+1)D-18 backbone and global average pooling in the meta-learning paradigm; (3) the streamer action is recognized by calculating cosine similarity after sending the video-level features to CosAttn to generate a streamer action category prototype. Experimental results on several real-world action recognition datasets demonstrate the effectiveness of our method.
Keyword :
Prototypes Prototypes meta-learning paradigm meta-learning paradigm CosAttn CosAttn Optimization Optimization Streaming media Streaming media Training Training Feature extraction Feature extraction Testing Testing Live video Live video streamer action recognition streamer action recognition Task analysis Task analysis few-shot learning few-shot learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | He, Chen , Zhang, Jing , Yao, Jiacheng et al. Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video [J]. | IEEE SIGNAL PROCESSING LETTERS , 2022 , 29 : 1097-1101 . |
MLA | He, Chen et al. "Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video" . | IEEE SIGNAL PROCESSING LETTERS 29 (2022) : 1097-1101 . |
APA | He, Chen , Zhang, Jing , Yao, Jiacheng , Zhuo, Li , Tian, Qi . Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video . | IEEE SIGNAL PROCESSING LETTERS , 2022 , 29 , 1097-1101 . |
Export to | NoteExpress RIS BibTex |
Abstract :
With the advent of the we-media era, live video is being sought after by more and more web users. How to effectively identify and supervise the streamer activities in the live video is of great significance to promote the high-quality development of the live video industry. The streamer activity can be characterized by the temporal composition of a series of actions. To improve the accuracy of streamer temporal action detection, it is a promising path to utilize the temporal action location and co-attention mechanism to overcome the problem of blurring action boundary. Therefore, a streamer temporal action detection method by co-attention boundary matching in live video is proposed. (1) The global spatiotemporal features and action template features of live video are extracted by using two-stream convolutional network and action spatiotemporal attention network respectively. (2) The probability sequences are generated from the global spatiotemporal features through temporal action evaluation, and the boundary matching confidence maps are produced by confidence evaluation of global spatiotemporal features and action template features under the co-attention mechanism. (3) The streamer temporal actions are detected based on the action proposals generated by probability sequences and boundary matching maps. We establish a real-world streamer action BJUT-SAD dataset and conduct extensive experiments to verify that our method can boost the accuracy of streamer temporal action detection in live video. In particular, our temporal action proposal generation and streamer action detection task produce competitive results to prior methods, demonstrating the effectiveness of our method.
Keyword :
Boundary matching Boundary matching Co-attention Co-attention Streamer Streamer Live video Live video Temporal action detection Temporal action detection
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Chenhao , He, Chen , Zhang, Hui et al. Streamer temporal action detection in live video by co-attention boundary matching [J]. | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2022 , 13 (10) : 3071-3088 . |
MLA | Li, Chenhao et al. "Streamer temporal action detection in live video by co-attention boundary matching" . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS 13 . 10 (2022) : 3071-3088 . |
APA | Li, Chenhao , He, Chen , Zhang, Hui , Yao, Jiacheng , Zhang, Jing , Zhuo, Li . Streamer temporal action detection in live video by co-attention boundary matching . | INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS , 2022 , 13 (10) , 3071-3088 . |
Export to | NoteExpress RIS BibTex |
Abstract :
本发明提供一种网络直播视频的压缩伪影去除方法,该方法包括:获取压缩视频;将所述压缩视频输入压缩伪影去除模型中,得到所述压缩伪影去除模型输出的与所述压缩视频相对应的高质量恢复视频;其中,所述压缩伪影去除模型利用循环神经网络RNN和膨胀卷积,对未知压缩码率且含有压缩伪影的所述压缩视频进行恢复,以生成所述高质量恢复视频。本发明可以在未知压缩码率的状况下,通过使用单个网络模型来恢复压缩视频,从而可以提供高质量的网络直播视频。
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | 李嘉锋 , 高宇麒 , 张菁 et al. 一种网络直播视频的压缩伪影去除方法及装置 : CN202110649651.3[P]. | 2021-06-10 . |
MLA | 李嘉锋 et al. "一种网络直播视频的压缩伪影去除方法及装置" : CN202110649651.3. | 2021-06-10 . |
APA | 李嘉锋 , 高宇麒 , 张菁 , 郜征 , 徐晗 . 一种网络直播视频的压缩伪影去除方法及装置 : CN202110649651.3. | 2021-06-10 . |
Export to | NoteExpress RIS BibTex |
Abstract :
Live video hosted by streamer is being sought after by more and more Internet users. A few streamers show inappropriate action in normal live video content for profit and popularity, who bring great harm to the network environment. In order to effectively regulate the streamer behavior in live video, a strea-mer action recognition method in live video with spatial-temporal attention and deep dictionary learning is proposed in this paper. First, deep features with spatial context are extracted by a spatial attention net-work to focus on action region of streamer after sampling video frames from live video. Then, deep fea-tures of video are fused by assigning weights with a temporal attention network to learn the frame attention from an action. Finally, deep dictionary learning is used to sparsely represent the deep features to further recognize streamer actions. Four experiments are conducted on a real-world dataset, and the competitive results demonstrate that our method can improve the accuracy and speed of streamer action recognition in live video. (c) 2021 Elsevier B.V. All rights reserved.
Keyword :
Streamer Streamer Action recognition Action recognition Live video Live video Spatial-temporal attention Spatial-temporal attention Deep dictionary learning Deep dictionary learning
Cite:
Copy from the list or Export to your reference management。
GB/T 7714 | Li, Chenhao , Zhang, Jing , Yao, Jiacheng . Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning [J]. | NEUROCOMPUTING , 2021 , 453 : 383-392 . |
MLA | Li, Chenhao et al. "Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning" . | NEUROCOMPUTING 453 (2021) : 383-392 . |
APA | Li, Chenhao , Zhang, Jing , Yao, Jiacheng . Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning . | NEUROCOMPUTING , 2021 , 453 , 383-392 . |
Export to | NoteExpress RIS BibTex |
Export
Results: |
Selected to |
Format: |