团队介绍
Computer architecture, Mobile storage and AI storage
Now my research group has six doctoral students and fourteen postgraduate students.
We warmly welcome students with similar research interests and a love of computer technology to join our team, and we also welcome our colleagues to communicate with us and discuss the technology together.
导师介绍
谭支鹏,博士,教授,博士生导师,武汉光电国家研究中心信息存储与光显示功能实验室主任助理。华中科技大学计算机系统结构专业博士学位,研究方向计算机系统结构、大数据存储与管理、移动存储等。在计算机软件与理论和计算机系统结构信息存储技术两个研究领域从事过较长时间的系统专门研究,取得了一定的研究成果,具备较强的软件开发能力。作为项目骨干完成了863重大专项项目“国产通用数据库管理系统研发”(已经结题并取得良好经济效益);国家973重大基础研究计划项目 “下一代互联网信息存储组织模式和核心技术研究”(已结题);国家973重大基础研究计划项目 “面向复杂应用环境的存储理论与关键技术”(已结题);作为项目负责人主持国防预研重点、国防预研基金、电子基金、IBM、华为,中船重工,海康威视等重要企业合作项目近30余项,合作经费千余万元。同时作为主要成员完成了存储备份和磁盘阵列等项目的产业化项目,取得了较好的经济效益。目前主要研究方向是分布式并行文件系统、移动存储与大数据存储管理等,从系统结构、数据的组织和分布式存储系统编码及性能优化等方面,为用户提供稳定、可靠、安全的高效存储服务。在TCAD、FGCS、JPDC、ICPP、ICCD、DATE、ICA3PP等权威学术期刊和会议上发表学术科研论文50多篇。申请发明专利20项、其中已授权12项。获湖北省科学技术进步一等奖二项、国防科学技术进步二等奖一项,中国电子学会科学技术进步二等奖一项。
Zhipeng TAN, Ph.D., Professor, PhD supervisor, Director Assistant of Laboratory of Information Storage and Optical Display Function, National Research Center of Optoelectronics, Wuhan. He got his doctor’s degree in Computer System Structure from Huazhong University of Science and Technology. His research interests are computer system structure, big data storage and management and mobile storage, etc. In the research field of computer software theory and computer system structure, he has researched long time and achieved certain research results and has a strong software development ability; He has presided over nearly 30 cooperation projects with key enterprises, national defense projects, Electronic fund, IBM, Huawei, Honor, OPPO, CSIC and HiKVision. At the same time as the main member to complete the storage backup and disk array and other projects industrialization, these have achieved good economic. He has published more than 50 academic research papers in TCAD, FGCS, JPDC, ICPP, ICCD, DATE, ICA3PP and other top journals and conferences. 30 invention patents have been applied, of which more than 20 have been authorized. It has won two first prizes of science and technology Progress of Hubei Province, one second prize of science and technology progress of national defense, and one second prize of science and technology Progress of Chinese Institute of Electronics. He has guided graduate students and undergraduate students to win Hubei Province Excellent Dissertation Award for many times.
主要学术方向:
1.大数据时代数据中心存储需求多样、存储数据类型复杂和存储设备日新月异,对分布式存储系统构建及系统软件提出了更高要求,围绕提高存储系统性能取得如下成果:
(1)提出了一种关键数据识别标准自适应的方法,对大数据I/O访问延迟进行预测和拟合,建立关键数据识别知识库,解决了关键数据识别的准确度不高的问题,将精准识别的关键数据优化布局,存放在高性能存储设备上,对提高存储系统数据存取性能有非常好的效果。与基于热度的识别和基于成本的识别两种最新方法相比,该方法的I/O响应时间分别减少了10.3%45.6%和16.3%25.1%。
(2)提出了大规模存储系统多队列I/O调度算法,通过构造多优先级队列,将同步I/O按同优先级在多队列进行调度,缩小了同步IO处理间隔时间,并能动态调整应用的优先级,降低了大规模存储系统平均响应时间。与现有方法相比在有64个并发写应用场景中,平均完成时间减少30%以上,在有64个并发读应用场景中,平均完成时间减少50%以上。并且系统规模越大并发进程越多,平均完成时间减少幅度越大。
(3)提出了基于负载预测的动态数据分布策略,该策略考虑存储系统的CPU负载、内存占用,磁盘容量、IO负载、网络带宽等多种因素,应用机器学习算法,通过预测建立动态的数据分布策略,该策略能高效地利用好高性能设备和低性能的服务能力,保障存储系统能持续稳定的输出高性能存储服务。相比HDFS,应用该策略后在不同场景下系统吞吐率最大能提高有70.6%,响应时间最大能较少30.4%。
2.大数据时代数据量的急剧增长导致存储系统规模越来越大,而分布式存储系统随着盘设备的增加,故障概率也增大,为了避免因存储设备故障带来的数据丢失,将AI技术与存储系统相结合,围绕提高存储系统可靠性取得如下成果:
(1)首次提出了基于多维属性的盘故障预测方法。该方法除了基于传统的盘SMART信息外,还将系统事件、固件版本、蓝屏日志等与故障有强相关性的特征综合起来建模,显著地提高盘故障预测的召回率和准确率,并降低了误报率。相比与现有方法和开源系统 Ceph-prophetstor,该方法召回率高出了9%-188.87%,准确率高出12%-35.55%,并且误报率在0.17%以下。
(2)提出了存储系统盘设备健康状态管理方法。随着时间的推移,存储系统盘设备的特性也在不断发生转变,最常见的就是访问速度的快慢变化,称之为盘设备处于不统的健康状态。该方法通过实时采集IO状态信息和块设备层的硬盘IO耗时信息,将盘的生命周期进行量化分析,分为健康、亚健康和故障三种状态,构建Markov模型,监测盘设备的健康状态变化,实时管理盘设备健康状态。
3.当前以手机为代表的终端设备广泛采用闪存作为存储介质,而闪存具有高性能、低功耗等优势,但也存在擦写次数有限、异地更新导致的写放大等问题,随着时间的推移,性能越来越差。针对手机在使用一段时间后就会出现明显的卡顿现象,提出了针对闪存管理的优化方法,提高终端设备性能:
(1)提出了自适应的空间预留和碎片管理方法,该方法针对移动设备存储碎片和垃圾回收问题,应用AI算法自适应选择预留文件,在预留空间大小上提出动态调整预留大小的策略,预留最优大小空间,明显延缓了移动设备使用时间一长就出现卡顿的现象。与现有方法相比性能提高了94.28%。
(2)针对移动设备温数据占比较大的现状,提出了基于细粒度的温数据及段清理管理方法,该方法使用K-means算法来动态识别移动存储系统温数据热度,基于细粒度的温数据实现多日志延迟写,减少段清理的开销和能耗损失,延长移动设备健康状态时长,延缓移动设备性能衰减速度。与现有方法相比,该方法进行段清理时的迁移块减少了64.77%至99.99%,在空间利用率达到90%的情况下系统带宽增加了4.78%至145.97%,CPU利用率平均低27.19%。
Main research subjects:
1.In the era of big data, data centers have diverse storage requirements, complex storage data types, and ever-changing storage devices, which put higher requirements on the construction of distributed storage systems and system software. The following achievements have been made to improve storage system performance:
(1) A key data identification standard adaptive method is proposed to predict and fit the I/O access delay of big data, establish the key data identification knowledge base, solve the problem of low accuracy of key data identification, optimize the layout of accurately identified key data and store it in high performance storage devices, which has a very good effect on improving the data access performance of storage system. The I/O response time of this method is reduced by 10.3%45.6% and 16.3%25.1%, respectively, compared with heat based recognition and cost based recognition.
(2) A multi-queue I/O scheduling algorithm for large-scale storage systems is proposed. By constructing multi-priority queues, synchronous I/ OS are scheduled in multi-queues according to the same priority, which reduces the synchronous I/O processing interval, dynamically adjusts the priority of applications, and reduces the average response time of large-scale storage systems. Compared with the existing method, the average completion time is reduced by more than 30% in the application scenario with 64 concurrent writes and by more than 50% in the application scenario with 64 concurrent reads. In addition, the larger the system scale, the more concurrent processes, the greater the reduction of average completion time.
(3) This paper proposes a dynamic data distribution strategy based on load prediction, which takes into account the CPU load, memory occupation, disk capacity, IO load, network bandwidth and other factors of the storage system, and uses machine learning algorithm to build a dynamic data distribution strategy through prediction. This strategy can effectively utilize high-performance devices and low-performance service capabilities. Ensure that the storage system can consistently output high-performance storage services. Compared with HDFS, this policy can increase the system throughput by 70.6% and reduce the response time by 30.4% in different scenarios.
2.In the era of big data, the rapid growth of data volume leads to the increasing scale of storage systems, and the failure probability of distributed storage systems increases with the increase of disk devices. To avoid data loss caused by storage device failures, AI technology is combined with the storage system and the following achievements are made to improve the reliability of the storage system:
(1) A disk fault prediction method based on multidimensional attribute is proposed for the first time. In addition to the traditional disk SMART information, this method also integrated the system event, firmware version, blue screen log and other features strongly related to the fault to build a model, significantly improving the recall rate and accuracy of disk fault prediction, and reducing the false positive rate. Compared with existing methods and the open source system Ceph-prophetstor, the method recalls 9%188.87% more, presents a prophetStor 12%35.55% higher accuracy, and a false positive rate of less than 0.17%.
(2) A method of managing the health status of the storage system disk is proposed. As time goes by, the features of a storage system disk constantly change. The most common change is the access speed. The disk is in the abnormal health state. By collecting IO status information and disk IO time information of the block device layer in real time, the life cycle of the disk is quantitatively analyzed and divided into three states: health, sub-health and failure. Markov model is built to monitor the change of disk health status and manage disk health status in real time.
3.At present, flash memory is widely used as the storage medium by terminal devices represented by mobile phones, which has advantages such as high performance and low power consumption. However, flash memory also has some problems such as limited erase times and write amplification caused by remote updates. As time goes by, the performance becomes worse and worse. In view of the obvious stall phenomenon that will occur after a period of use of mobile phones, an optimization method for flash management is proposed to improve the performance of terminal equipment:
(1) An adaptive space reservation and debris management method is proposed. Aiming at the problem of storage debris and garbage recovery of mobile devices, this method applies AI algorithm to select reserved files adaptively, and proposes the strategy of dynamically adjusting reserved space to reserve the optimal size of reserved space, which obviously delays the phenomenon that the mobile devices will be stuck for a long time. Compared with the existing method, the performance is improved by 94.28%.
(2) In view of the large proportion of mobile device temperature data, a fine-grained temperature data and segment cleaning management method is proposed. This method uses K-means algorithm to dynamically identify the heat of mobile storage system temperature data, realizes multi-log writing delay based on fine-grained temperature data, reduces the cost of segment cleaning and energy consumption loss, and prolonging the health status of mobile devices. Slow down the performance decay of mobile devices. Compared with the existing method, the migrated blocks during segment cleaning are reduced by 64.77% to 99.99%, the system bandwidth is increased by 4.78% to 145.97% when the space utilization reaches 90%, and the CPU utilization is 27.19% lower on average.