首页 | 官方网站   微博 | 高级检索  
     

基于语义相似性的跨模态图文内容筛选存储机制研究
引用本文:刘渝,郭婵,冯树耀,周可,肖志立.基于语义相似性的跨模态图文内容筛选存储机制研究[J].计算机研究与发展,2021,58(2):338-355.
作者姓名:刘渝  郭婵  冯树耀  周可  肖志立
作者单位:华中科技大学武汉光电国家研究中心 武汉430074;华中科技大学武汉光电国家研究中心 武汉430074;华中科技大学武汉光电国家研究中心 武汉430074;深圳市腾讯计算机系统有限公司技术工程事业群 广东深圳518054
基金项目:国家自然科学基金青年科学基金项目;国家自然科学基金创新群体项目
摘    要:随着多媒体数据的爆发式增长,云端数据呈现出大规模多模态混合并存的特性.服务于数据分析的传统存储系统因为缺乏数据的语义管理而面临读取延时超长的挑战.针对图像和文本2种模态数据,在传统存储系统之上提出一种跨模态图文数据内容筛选存储机制(cross-modal image and text content sifting storage,CITCSS),用于提供大规模在线相似性内容筛选服务,从存储系统层面缓解数据分析时必须从存储中读出所有数据的读带宽压力.机制分为离线与在线2个阶段.离线阶段中,引入基于自监督的生成对抗式Hash方法,系统利用这种方法生成语义元数据.然后,将元数据注入独立的元数据空间.最后,根据相似性Hash码间汉明距离能够度量语义距离的特点,利用Neo4j图数据库构建Hash元数据图谱,并在语义图谱中建立Hash码与存储路径之间的映射.在线阶段中,用户发送与分析相关的图像或文本,存储系统首先转化数据为Hash码.然后,在筛选半径内通过Hash元数据图谱寻找相似节点,进而找到相似文件的底层存储路径返回筛选数据.实验结果表明,与传统语义存储系统相比,CITCSS在召回率超过98%的性能下,读取延迟相对降低了99.07%~99.77%.

关 键 词:语义管理  Hash码元数据  元数据图谱  存储机制  读带宽

Content Sifting Storage Mechanism for Cross-Modal Image and Text Data Based on Semantic Similarity
Liu Yu,Guo Chan,Feng Shuyao,Zhou Ke,Xiao Zhili.Content Sifting Storage Mechanism for Cross-Modal Image and Text Data Based on Semantic Similarity[J].Journal of Computer Research and Development,2021,58(2):338-355.
Authors:Liu Yu  Guo Chan  Feng Shuyao  Zhou Ke  Xiao Zhili
Affiliation:(Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology,Wuhan 430074;Technology and Engineering Group,Tencent Inc.,Shenzhen,Guangdong 518054)
Abstract:With the explosive growth of multimedia data,the data in cloud becomes heterogeneous and large.The conventional storage systems served for data analysis face the challenge of long read latency due to the lack of semantic management of data.To solve this problem,a cross-modal image and text content sifting storage(CITCSS)mechanism is proposed,which saves the read bandwidth by only reading relevant data.The mechanism consists of the off-line and on-line stages.In the off-line stage,the system first uses the self-supervised adversarial Hash learning algorithm to learn and map the stored data to similar Hash codes.Then,these Hash codes are connected by Hamming distances and managed by the metadata style.In the implement,we use Neo4j to construct the semantic Hash code graph.Furthermore,we insert storage paths into the property of node to accelerate reading.In the on-line stage,our mechanism first maps the image or text represented the analysis requirement into Hash codes and sends them to the semantic Hash code graph.Then,the relevant data will be found by the sifting radius on the graph,and returned to the user finally.Benefiting from our mechanism,storage systems can perceive and manage semantic information resulting in advance service for analysis.Experimental results on public cross-modal datasets show that CITCSS can greatly reduce the read latency by 99.07%to 99.77%with more than 98%recall rate compared with conventional semantic storage systems.
Keywords:semantic management  Hash code metadata  metadata graph  storage mechanism  read bandwidth
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号