- 无标题文档
查看论文信息

中文题名:

 无人机强化学习动态编队及路径规划研究    

姓名:

 唐恒    

学号:

 20131213274    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 081102    

学科名称:

 工学 - 控制科学与工程 - 检测技术与自动化装置    

学生类型:

 硕士    

学位:

 工学硕士    

学校:

 西安电子科技大学    

院系:

 空间科学与技术学院    

专业:

 控制科学与工程    

研究方向:

 强化学习、无人机路径规划    

第一导师姓名:

 孙伟    

第一导师单位:

 西安电子科技大学    

完成日期:

 2023-06-16    

答辩日期:

 2023-05-23    

外文题名:

 Research on Dynamic Formation and Path Planning of UAVs with Reinforcement Learning    

中文关键词:

 动态环境 ; 强化学习 ; 奖励函数 ; 无人机 ; 动态编队 ; 路径规划    

外文关键词:

 dynamic environment ; reinforcement learning ; reward function ; UAV ; dynamic formation ; path planning    

中文摘要:

进入21世纪以来,无人机的应用场景逐渐扩大,单台无人机很难胜任日益复杂的任务需求,学者们将目光投向了无人机编队,多台无人机能够实现功能互补和能力叠加,具备得天独厚的优势。无论是单台无人机还是无人机编队,路径规划都是一个重要的研究方向。单台无人机路径规划重点关注路径的安全性和高效性,而无人机编队的路径规划还需要关注编队的控制问题。传统的路径规划方法和编队控制方法都无法针对未知环境中的突发威胁及时做出响应,也很难同时满足编队结构稳定性和队形调整自主性。强化学习特别擅长于智能体的决策问题,其不依赖于先验环境,对于动态环境的突发威胁也能应对自如。强化学习的奖励函数对于提升算法收敛速度和鲁棒性至关重要,本文针对各种场景下的无人机路径规划问题展开深入研究,设计并改进适合于不同场景的奖励函数,旨在提升无人机的智能决策和协同控制能力,主要贡献如下:

1. 针对未知动态环境下单台无人机路径规划问题,本文提出了一种基于TD3-IADRF算法的无人机智能决策方案。首先针对无障碍物环境,设计了无人机结束本回合训练的稀疏性奖励函数和加快算法收敛的引导性奖励函数。然后深入分析复合障碍物环境下无人机的避障需求,对无人机、障碍物、目的地的相对位置关系进行分类讨论,分析出无人机的最佳飞行角,在此基础上对飞行角奖励函数进行改进;同时基于无人机由于避障可能短暂远离目的地这一特性,对距离目的地奖励函数进行改进,在此基础上提出了TD3-IADRF算法。最后设计了对比实验,在复合障碍物环境中,本文所提出的改进奖励函数能将算法成功率提升25.7%,将收敛后的奖励平均值提升78.2%。

2. 针对未知动态环境下无人机编队路径规划问题,本文提出了一种基于MATD3-IDFRF算法的无人机编队智能决策方案。首先针对无障碍物环境,拓展了稀疏性奖励函数。然后深入分析无人机编队路径规划中重点关注的动态编队问题,其描述为无人机编队以稳定的编队结构进行飞行,并适时地根据周围环境微调队形。分析其本质即是每两台无人机的间距保持相对稳定,同时也受外界环境而微调。为此设计了基于每两台无人机之间最佳间距和当前间距的奖励函数,在此基础上提出动态编队奖励函数,并结合MATD3算法提出了MATD3-IDFRF算法。最后设计对比实验,在复合障碍物环境中,本文的动态编队奖励函数能将算法成功率提升6.8%,将收敛后的奖励平均值提升2.3%,将编队变形率降低97%。

外文摘要:

Since the 21st century, the application scenarios of UAVs have gradually expanded, and it is difficult for a single UAV to meet the increasingly complex mission requirements. Scholars have turned their attention to UAV formations, where multiple UAVs have the unique advantage of being able to achieve complementary functions and superimposed capabilities. Whether it is a single UAV or a UAV formation, path planning is an important research direction. Single UAV path planning focuses on the safety and efficiency of the path, while the path planning of UAV formation also needs to focus on the control of the formation. Traditional path planning methods and formation control methods are unable to respond to unexpected threats in an unknown environment in a timely manner, and it is difficult to satisfy both formation structure stability and formation adjustment autonomy. Reinforcement learning is particularly good at the decision making problem of intelligences, which does not depend on the a priori environment and is able to respond to emergent threats in dynamic environments. The reward function of reinforcement learning is crucial to improve the convergence speed and robustness of the algorithm. In this paper, we conduct an in-depth study on the UAV path planning problem in various scenarios, and design and improve the reward functions suitable for different scenarios, aiming to improve the intelligent decision making and cooperative control capability of UAVs, with the following main contributions:

 

1. For the single UAV path planning problem in unknown dynamic environment, this paper proposes an intelligent UAV decision scheme based on TD3-IADRF algorithm. Firstly, for the obstacle-free environment, a sparse reward function for the UAV to end the current round of training and a bootstrap reward function to accelerate the convergence of the algorithm are designed. Then, we deeply analyze the obstacle avoidance demand of UAV in compound obstacle environment, classify and discuss the relative position relationship of UAV, obstacle and destination, analyze the optimal flight angle of UAV, and improve the flight angle reward function on this basis; meanwhile, based on the characteristic that UAV may move away from the destination for a short time due to obstacle avoidance, we improve the distance to destination reward function, and on this basis, we propose the TD3-IADRF algorithm. Finally, a comparison experiment is designed, and the improved reward function proposed in this paper can improve the success rate of the algorithm by 25.7% and improve the converged reward average by 78.2% in the compound obstacle environment.

 

2. For the UAV formation path planning problem in unknown dynamic environment, this paper proposes an intelligent decision scheme for UAV formation based on MATD3-IDFRF algorithm. Firstly, the sparsity reward function is extended for the obstacle-free environment. Then the dynamic formation problem, which is the focus of attention in UAV formation path planning, is analyzed in depth. It is described as a UAV formation flying in a stable formation structure and fine-tuning the formation in time according to the surrounding environment. The essence of the analysis is that the spacing between each two UAVs remains relatively stable, while it is also fine-tuned by the external environment. To this end, a reward function based on the optimal spacing between each two UAVs and the current spacing is designed, based on which the dynamic formation reward function is proposed, and based on which the MATD3-IDFRF algorithm is proposed. Finally, comparison experiments are designed, and the dynamic formation reward function in this paper can improve the algorithm success rate by 6.8%, improve the converged reward average by 2.3%, and reduce the formation deformation rate by 97% in the composite obstacle environment.

参考文献:
[1]段海滨, 何杭轩, 赵彦杰, 等. 2022 年无人机热点回眸[J]. 科技导报, 2023, 41(01): 215-229.
[2]Shao X L, Liu H C, Zhang W D, et al. Path driven formation-containment control of multiple UAVs: A path-following framework[J]. Aerospace Science and Technology, 2023, 135: 108168.
[3]Zhou C H, Li J X, Shi Y J, et al. Research on Multi-Robot Formation Control Based on MATD3 Algorithm[J]. Applied Sciences, 2023, 13(3): 1874.
[4]周治国, 余思雨, 于家宝, 等. 面向无人艇的 T-DQN 智能避障算法研究[J]. 自动化学报, 2021: 1001-1011.
[5]Li Y J, Wei W, Gao Y, et al. PQ-RRT*: An improved path planning algorithm for mobile robots[J]. Expert systems with applications, 2020, 152: 113425.
[6]Xin J F, Zhong J B, Yang F R, et al. An improved genetic algorithm for path-planning of unmanned surface vehicle[J]. Sensors, 2019, 19(11): 2640.
[7]Tahir A, Böling J M, Haghbayan M H, et al. Comparison of linear and nonlinear methods for distributed control of a hierarchical formation of UAVs[J]. IEEE Access, 2020, 8: 95667-95680.
[8]Jin K F, Wang J, Wang H D, et al. Soft formation control for unmanned surface vehicles under environmental disturbance using multi-task reinforcement learning[J]. Ocean Engineering, 2022, 260: 112035.
[9]Botvinick M, Ritter S, Wang J X, et al. Reinforcement learning, fast and slow[J]. Trends in cognitive sciences, 2019, 23(5): 408-422.
[10]Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv: 1312.5602, 2013.
[11]Minsky M L. Theory of neural-analog reinforcement systems and its application to the brain model problem[M]. Princeton University, 1954.
[12]Howard R A. Dynamic programming and markov processes[J]. 1960.
[13]Watkins C J C H. Learning from delayed rewards[J]. 1989.
[14]Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint arXiv: 1509.02971, 2015.
[15]Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C]//International conference on machine learning. PMLR, 2018: 1587-1596.
[16]Lowe R, Wu Y I, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[J]. Advances in neural information processing systems, 2017, 30.
[17]Jiang D, Cai Z Q, Peng H J, et al. Coordinated control based on reinforcement learning for dual-arm continuum manipulators in space capture missions[J]. Journal of Aerospace Engineering, 2021, 34(6): 04021087.
[18]Foerster J, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
[19]Schmidhuber J, Zhao J, Wiering M A. Simple principles of metalearning[J]. Technical report IDSIA, 1996, 69: 1-23.
[20]Wang J X, Kurth-Nelson Z, Tirumala D, et al. Learning to reinforcement learn[J]. arXiv preprint arXiv: 1611.05763, 2016.
[21]Fang Q, Zhang W Z, Wang X T. Visual Navigation Using Inverse Reinforcement Learning and an Extreme Learning Machine[J]. Electronics, 2021, 10(16): 1997.
[22]Latombe J C. Robot motion planning[M]. Springer Science & Business Media, 2012.
[23]Luo M, Hou X R, Yang J. Surface optimal path planning using an extended Dijkstra algorithm[J]. IEEE access, 2020, 8: 147827-147838.
[24]Hart P E, Nilsson N J, Raphael B. A formal basis for the heuristic determination of minimum cost paths[J]. IEEE transactions on Systems Science and Cybernetics, 1968, 4(2): 100-107.
[25]Khatib O. Real-time obstacle avoidance for manipulators and mobile robots[J]. The international journal of robotics research, 1986, 5(1): 90-98.
[26]Xinchao Z. Simulated annealing algorithm with adaptive neighborhood[J]. Applied Soft Computing, 2011, 11(2): 1827-1836.
[27]Miao C W, Chen G Z, Yan C L, et al. Path planning optimization of indoor mobile robot based on adaptive ant colony algorithm[J]. Computers & Industrial Engineering, 2021, 156: 107230.
[28]Katoch S, Chauhan S S, Kumar V. A review on genetic algorithm: past, present, and future[J]. Multimedia Tools and Applications, 2021, 80: 8091-8126.
[29]周宏宇, 王小刚, 单永志, 等. 基于改进粒子群算法的飞行器协同轨迹规划[J]. 自动化学报, 2022, 48(11): 2670-2676.
[30]Low E S, Ong P, Cheah K C. Solving the optimal path planning of a mobile robot using improved Q-learning[J]. Robotics and Autonomous Systems, 2019, 115: 143-161.
[31]Tai L, Liu M. A robot exploration strategy based on q-learning network[C]//2016 ieee international conference on real-time computing and robotics (rcar). IEEE, 2016: 57-62.
[32]张荣霞, 武长旭, 孙同超, 等. 深度强化学习及在路径规划中的研究进展[J]. 计算机工程与应用, 2021, 57(19): 44-56.
[33]Guo S Y, Zhang X G, Du Y Q, et al. Path planning of coastal ships based on optimized DQN reward function[J]. Journal of Marine Science and Engineering, 2021, 9(2): 210.
[34]Yang Y, Li J T, Peng L L. Multi-robot path planning based on a deep reinforcement learning DQN algorithm[J]. CAAI Transactions on Intelligence Technology, 2020, 5(3): 177-183.
[35]Lin G C, Zhu L X, Li J H, et al. Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning[J]. Computers and Electronics in Agriculture, 2021, 188: 106350.
[36]Guo S Y, Zhang X G, Zheng Y S, et al. An autonomous path planning model for unmanned ships based on deep reinforcement learning[J]. Sensors, 2020, 20(2): 426.
[37]Gao J L, Ye W J, Guo J, et al. Deep reinforcement learning for indoor mobile robot path planning[J]. Sensors, 2020, 20(19): 5493.
[38]Zhang S T, Li Y B, Dong Q H. Autonomous navigation of UAV in multi-obstacle environments based on a Deep Reinforcement Learning approach[J]. Applied Soft Computing, 2022, 115: 108194.
[39]He L, Aouf N, Whidborne J F, et al. Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data[J]. arXiv preprint arXiv: 2008.02521, 2020.
[40]Li B, Gan Z G, Chen D Q, et al. UAV maneuvering target tracking in uncertain environments based on deep reinforcement learning and meta-learning[J]. Remote Sensing, 2020, 12(22): 3789.
[41]Qie H, Shi D X, Shen T L, et al. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning[J]. IEEE access, 2019, 7: 146264-146272.
[42]Li X J, Liu H, Li J Q, et al. Deep deterministic policy gradient algorithm for crowd-evacuation path planning[J]. Computers & Industrial Engineering, 2021, 161: 107621.
[43]Li J T, Liu K, Zhang T X. Multi-agent Deep Reinforcement Learning for Dynamic Motion Control Allocation in UAV Swarm Cooperative Jamming to Netted Radar[C]//Proceedings of 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022). Singapore: Springer Nature Singapore, 2023: 1204-1213.
[44]Mao H Y, Gong Z B, Ni Y, et al. ACCNet: Actor-coordinator-critic net for "Learning-to-communicate" with deep multi-agent reinforcement learning[J]. arXiv preprint arXiv:1706.03235, 2017.
[45]吴军, 徐昕, 连传强, 等. 采用核增强学习方法的多机器人编队控制[J]. 机器人, 2011, 33(3): 379-384.
[46]Turpin M, Michael N, Kumar V. Trajectory design and control for aggressive formation flight with quadrotors[J]. Autonomous Robots, 2012, 33: 143-156.
[47]Ghamry K A, Dong Y, Kamel M A, et al. Real-time autonomous take-off, tracking and landing of UAV on a moving UGV platform[C]//2016 24th Mediterranean conference on control and automation (MED). IEEE, 2016: 1236-1241.
[48]Ren W, Beard R W. Decentralized scheme for spacecraft formation flying via the virtual structure approach[J]. Journal of Guidance, Control, and Dynamics, 2004, 27(1): 73-82.
[49]Lalish E, Morgansen K A, Tsukamaki T. Formation tracking control using virtual structures and deconfliction[C]//Proceedings of the 45th IEEE Conference on Decision and Control. IEEE, 2006: 5699-5705.
[50]Ren W, Beard R W, McLain T W. Coordination variables and consensus building in multiple vehicle systems[C]//Cooperative Control: A Post-Workshop Volume 2003 Block Island Workshop on Cooperative Control. Springer Berlin Heidelberg, 2005: 171-188.
[51]Kuriki Y, Namerikawa T. Formation control with collision avoidance for a multi-UAV system using decentralized MPC and consensus-based control[J]. SICE Journal of Control, Measurement, and System Integration, 2015, 8(4): 285-294.
[52]Subramanian J, Mahajan A. Renewal Monte Carlo: Renewal theory-based reinforcement learning[J]. IEEE Transactions on Automatic Control, 2019, 65(8): 3663-3670.
[53]万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019, 32(1): 67-81.
[54]温广辉, 杨涛, 周佳玲, 等. 强化学习与自适应动态规划: 从基础理论到多智能体系统中的应用进展综述[J]. 控制与决策.
[55]Eschmann J. Reward function design in reinforcement learning[J]. Reinforcement Learning Algorithms: Analysis and Applications, 2021: 25-33.
[56]李跃, 邵振洲, 赵振东, 等. 面向轨迹规划的深度强化学习奖励函数设计[J]. 计算机工程与应用, 2020, 56(2): 226-232.
[57]Tovarnov M S, Bykov N V. Reinforcement learning reward function in unmanned aerial vehicle control tasks[C]//Journal of Physics: Conference Series. IOP Publishing, 2022, 2308(1):012004.
[58]Xie J X, Shao Z Z, Li Y, et al. Deep reinforcement learning with optimized reward functions for robotic trajectory planning[J]. IEEE Access, 2019, 7: 105669-105679.
[59]Rioual Y, Moullec Y L, Laurent J, et al. Design and Comparison of Reward Functions in Reinforcement Learning for Energy Management of Sensor Nodes[J]. arXiv preprint arXiv: 2106. 01114, 2021.
[60]贺亮, 徐正国, 贾愚, 等. 深度强化学习复原多目标航迹的 TOC 奖励函数[J]. 计算机应用研究, 2020, 37(6): 1626-1632.
[61]曹景祥, 刘其成. 基于深度强化学习的路径规划算法研究[J]. 计算机应用与软件, 2022, 39(11): 231-237.
[62]李正平, 鲜斌. 基于虚拟结构法的分布式多无人机鲁棒编队控制[J]. 控制理论与应用, 2020, 37(11): 2423-2431.
[63]Yan X, Jiang D P, Miao R L, et al. Formation control and obstacle avoidance algorithm of a multi-USV system based on virtual structure and artificial potential field[J]. Journal of Marine Science and Engineering, 2021, 9(2): 161.
[64]Kowdiki K H, Barai R K, Bhattacharya S. Leader-follower formation control using artificial potential functions: A kinematic approach[C]//IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM-2012). IEEE, 2012: 500-505.
[65]王锦锦, 祁圣君, 钟海, 等. 基于 Dubins 曲线的一致性编队集结控制[J]. 计算机仿真, 2021, 38(7): 40-44.
[66]Tang J. Analysis and improvement of traffic alert and collision avoidance system[J]. IEEE Access, 2017, 5: 21419-21429.
[67]Liao F, Teo R, Wang J L, et al. Distributed formation and reconfiguration control of VTOL UAVs[J]. IEEE Transactions on Control Systems Technology, 2016, 25(1): 270-277.
[68]Liu H, Peng F, Modares H, et al. Heterogeneous formation control of multiple rotorcrafts with unknown dynamics by reinforcement learning[J]. Information Sciences, 2021, 558: 194-207.
[69]Pan C, Peng Z, Liu L, et al. Data-driven distributed formation control of under-actuated unmanned surface vehicles with collision avoidance via model-based deep reinforcement learning[J]. Ocean Engineering, 2023, 267: 113166.
中图分类号:

 V27    

馆藏号:

 57038    

开放日期:

 2023-12-23    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式