首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于关键帧的双流卷积网络的人体动作识别方法
引用本文:张聪聪,何宁.基于关键帧的双流卷积网络的人体动作识别方法[J].南京气象学院学报,2019,11(6):716-721.
作者姓名:张聪聪  何宁
作者单位:北京联合大学 机器人学院, 北京, 100101,北京联合大学 智慧城市学院, 北京, 100101
基金项目:国家自然科学基金(61872042,61572077);北京市自然科学基金委和北京市教委联合重点项目(KZ201911417048)
摘    要:针对视频序列中人体动作识别存在信息冗余大、准确率低的问题,提出基于关键帧的双流卷积网络的人体动作识别方法.该方法构建了由特征提取、关键帧提取和时空特征融合3个模块构成的网络框架.首先将空间域视频的单帧RGB图像和时间域多帧叠加后的光流图像作为输入,送入VGG16网络模型,提取视频的深度特征;其次提取视频的关键帧,通过不断预测每个视频帧的重要性,选取有足够信息的有用帧并汇聚起来送入神经网络进行训练,选出关键帧并丢弃冗余帧;最后将两个模型的Softmax输出加权融合作为输出结果,得到一个多模型融合的人体动作识别器,实现了对视频的关键帧处理和对动作的时空信息的充分利用.在UCF-101公开数据集上的实验结果表明,与当前人体动作识别的主流方法相比,该方法具有较高的识别率,并且相对降低了网络的复杂度.

关 键 词:关键帧  双流网络  动作识别  特征提取  特征融合
收稿时间:2019/10/7 0:00:00

Human motion recognition based on key frame two-stream convolutional network
ZHANG Congcong and HE Ning.Human motion recognition based on key frame two-stream convolutional network[J].Journal of Nanjing Institute of Meteorology,2019,11(6):716-721.
Authors:ZHANG Congcong and HE Ning
Institution:Robotics College, Beijing Union University, Beijing 100101 and Smart City College, Beijing Union University, Beijing 100101
Abstract:Aiming at the problem of large information redundancy and low accuracy in human motion recognition in video sequences,a human motion recognition method is proposed based on key frame two-stream convolutional network. We construct a network framework consisting of three modules:feature extraction,key frame extraction,and spatial-temporal feature fusion.Firstly,the single-frame RGB image of the spatial domain video and the optical flow image superimposed in the time domain multi-frame are sent as input to the VGG16 network model to extract the depth feature of the video;secondly,the importance of each video frame is continuously predicted,then useful frames with sufficient information are pooled and trained by neural network to select key frames and discard redundant frames.Finally,the Softmax outputs of the two models are weighted and combined as the output result to obtain a multi-model fusion.The human body motion recognizer realizes the key frame processing of the video and the full utilization of the spatial-temporal information of the action.The experimental results on the UCF-101 public dataset show that,compared with the mainstream methods of human motion recognition,the proposed method has a higher recognition rate and relatively reduces the complexity of the network.
Keywords:keyframe  two stream networks  action recognition  feature extraction  feature fusion
点击此处可从《南京气象学院学报》浏览原始摘要信息
点击此处可从《南京气象学院学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号