基于正则表达式的结构化修复改进算法
作者:
作者单位:

1. 辽宁工程技术大学 电子与信息工程学院葫芦岛125105; 2. 渤海装备辽河重工有限公司盘锦124010

中图分类号:

TP312.2;TN911.72

基金项目:

辽宁省自然基金(2015020098)、辽宁工程技术大学博士启动基金(20151147)资助项目


Improved structural repairing algorithm based on regular expression
Author:
Affiliation:

1. School Electronics and Information Engineering, Liaoning Technical University, Huludao 125105, China; 2. China Petroleum Liaohe Equipment Company, Panjin 124010, China

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [16]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    针对结构化数据的清洗问题,以基于正则表达式的结构化修复(RSR)算法为基础,借鉴字符串之间编辑距离的计算思想,将违反偏序关系的边从自动机的边集中提取出来,仅对得到的边引入优先级队列来修正所对应的编辑距离,而其他边由于满足偏序关系则可直接通过递推式来计算,从而提出一种改进RSR算法。算法测试与分析结果表明,改进RSR算法在时间复杂度方面有明显优势,相对原算法的提升显著且稳定。

    Abstract:

    Aiming at the structural data cleaning, an improved structural repairing algorithm based on regular expression was proposed according to calculate the edit distance between strings. Firstly, the violation partial order edge from edge set of nondeterministic finite automata was extracted, then the edit distance for edge in it was only revised by priority queue. At the same time, others edge to satisfy the partial order relation could calculate by recursive formula instead of the complex priority queue. The experimental results show that the improved algorithm not only has obvious advantage in time complexity, but also the improvement rate is significant and stable comparted with the original algorithm.

    参考文献
    [1]ARENAS M, BERTOSSI L E, CHOMICKI J, et al. Scalar aggregation in inconsistent databases[J]. Theoretical Computer Science, 2003, 296(3): 405 434.
    [2]FAN W, GEERTS F, TAN N, et al.Inferring data currency and consistency for conflict resolution[C].IEEE International Conference on Data Engineering, 2013: 470481.
    [3]GEERTS F, MECCA G, PAPOTTI P, et al. The LLUNATIC datacleaning framework[J]. Proceedings of the VLDB Endowment, 2013, 6(9): 625636.
    [4]DONG X L, BERTIEQUILLE L, SRIVASTAVA D. Integrating conflicting data: The role of source dependence[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 550561.
    [5]DONG X L, BERTIEQUILLE L, SRIVASTAVA D. Truth discovery and copying detection in a dynamic world[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 562 573.
    [6]GALLAND A, ABITEBOUL S, MARIAN A, et al. Corroborating information from disagreeing views[C]. ACM International Conference on Web Search and Data Mining, 2010: 131140.
    [7]LAKSHMINARAYAN K, HARP S A, GOLDMAN R P, et al. Imputation of missing data using machine learning techniques[C].Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996: 140145.
    [8]MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER: A database approach for statistical inference and data cleaning[C]. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010: 7586.
    [9]SETIAWAN N A, VENKATACHALAM P A, HAN A F M. Missing attribute value prediction based on artificial neural network and rough set theory[C].Proceedings of the 2008 International Conference on Biomedical Engineering and Informatics, 2008: 306310.
    [10]LI Z, WANG H, SHAO W, et al. Repairing data through regular expressions[J]. Proceedings of the VLDB Endowment, 2016, 9(5): 432443.
    [11]WAGNER R A, FISCHER M J. The stringtostring correction problem[J]. Journal of the ACM, 1974, 21(1): 168173.
    [12]曹建军,刁兴春,汪挺,等.领域无关数据清洗研究综述[J]. 计算机科学, 2010, 37(5): 2629. CAO J J, DIAO X CH, WANG T, et al. Research on domainindependent data cleaning: A survey[J]. Computer Science, 2010, 37(5): 2629.
    [13]王日芬,章成志,张蓓蓓,等.数据清洗研究综述[J]. 现代图书情报技术, 2007, 2 (12): 5056. WANG R F, ZHANG CH ZH, ZHANG B B, et al. A survey of data cleaning[J]. New Technology of Library and Information Service, 2007, 2 (12): 5056.
    [14]刘喜文,郑昌兴,王文龙,等. 构建数据仓库过程中的数据清洗研究[J]. 图书与情报, 2013, 153(5): 2228. LIU X W, ZHENG CH X, WANG W L, et al. Research on data cleaning in the process of building data warehouse[J]. Library and Information, 2013, 153(5): 2228.
    [15]宋金玉,陈爽,郭大鹏,等.数据质量及数据清洗方法[J]. 指挥信息系统与技术, 2013, 4(5): 6370. SONG J Y, CHEN SH, GUO D P, et al. Data quality and data cleaning methods[J]. Command Information System and Technology, 2013, 4(5): 6370.
    [16]唐煜程,张明君,王浩宇,等.基于GPU的三维人脸数据动态线性快速修复[J].电子测量与仪器学报, 2016, 30(6): 959967. TANG Y CH, ZHANG M J, WANG H Y, et al. Fast linear recovering algorithm for low quality 3D face data based on GPU[J]. Journal of Electronic Measurement and Instrumentation, 2016,30(6):959967.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

陈万志,宋剑,王德建,王星.基于正则表达式的结构化修复改进算法[J].电子测量与仪器学报,2017,31(12):2036-2041

复制
分享
文章指标
  • 点击次数:2359
  • 下载次数: 8071
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 在线发布日期: 2018-01-24
文章二维码