American Journal of Information Science and Technology

| Peer-Reviewed |

Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation

Received: Feb. 20, 2022    Accepted: Mar. 16, 2022    Published: Mar. 29, 2022
Views:       Downloads:

Share This Article

Abstract

The significance of Bias Detection has increased appreciably, due to the increased application of AI. Although syntactic bias is well explored with statistical techniques, there remains semantic bias challenge like for example, Google’s face recognition which excludes colored people. Human expertise is required to detect semantic bias, e.g., for the application of the root-out-bias method. We propose a further automatization to this laborious method, based on the Training Data Improvement for Bias Mitigation (TDIBM). The concept, is to automatically construct a Semantic Network (SN) from the domain description of the training. For the semantic network nouns are extracted. As a second step, synonyms and semantically similar nouns are searched, e.g. in dictionaries, and added to the SNs. As a result, the SN contains nouns that enhances the given domain, with previously unknown knowledge. This SN can be used to check with, e.g., the root-out bias method, whether the training sample is biased, or not. Should the training sample be biased, then the corresponding nouns from the SN can be added to the training sample set to mitigate the bias. The newly developed method, TDIBM is evaluated twofold: Firstly, with the description of the COMPAS system, which is a case management and decision support tool used by U.S. courts to assess the likelihood of a defendant becoming a recidivist. Secondly, an autonomous driving domain is applied, to investigate accidental driving of a Tesla car. Here TDIBM detected among many new features, including one to solve ambiguous scene interpretations for autonomous driving vehicles.

DOI 10.11648/j.ajist.20220601.11
Published in American Journal of Information Science and Technology ( Volume 6, Issue 1, March 2022 )
Page(s) 1-7
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Semantic Bias Detection, Bias Mitigation, Semantic Networks, Semantic Similar Words, AI, Bias, Bias Detection, Training Sample

References
[1] Morik, K., Kietz, B., Emde, W., & Wrobel, S. (1993). Knowledge acquisition and machine learning. Morgan Kaufmann Publishers Inc.
[2] Lloyd, J. W. (2012). Foundations of logic programming. Springer Science & Business Media.
[3] Goñi, J., Arrondo, G., Sepulcre, J., Martincorena, I., de Mendizábal, N. V., Corominas-Murtra, B., & Villoslada, P. (2011). The semantic organization of the animal category: evidence from semantic verbal fluency and network theory. Cognitive processing, 12 (2), 183-196.
[4] Shapiro, S. C., & Rapaport, W. J. (1987). SMePS considered as a fully intensional propositional semantic network. In The knowledge frontier (pp. 262-315). Springer, New York, USA.
[5] Heydon, A., & Najork, M. (1999). Mercator: A scalable, extensible web crawler. World Wide Web, 2 (4), 219-229.
[6] Boldi, P., Codenotti, B., Santini, M., & Vigna, S. (2004). Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34 (8), 711-726.
[7] Shkapenyuk, V., & Suel, T. (2002). Design and implementation of a high-performance distributed web crawler. In Proceedings 18th International Conference on Data Engineering, pp. 357-368. IEEE.
[8] Thelwall, M. (2001). A web crawler design for data mining. Journal of Information Science, 27 (5), 319-325.
[9] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In International conference on machine learning, pp. 957-966.
[10] Lin, D. (1998). An information-theoretic definition of similarity. In ICML, vol. 98, pp. 296-304.
[11] Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2 (2), 1-25.
[12] Song, W., Feng, M., Gu, N., & Wenyin, L. (2007). Question similarity calculation for FAQ answering. In Third International Conference on Semantics, Knowledge and Grid (SKG 2007), pp. 298-301. IEEE. And http://wordnet.princeton.edu visited on January 2022.
[13] Englert, R., & Muschiol, J. (2020). Syntactic and Semantic Bias Detection and Countermeasures. In International Conference on Computational Science, pp. 629-638. Springer, Cham.
[14] Neff, G., & Nagy, P. (2016). Automation, algorithms, and politics| talking to Bots: Symbiotic agency and the case of Tay. International Journal of Communication, 10, 17.
[15] Angwin, J., Larson, J., Mattu, S., & Kirchner, L (2016). Machine bias. Pro Publica, May 23.
[16] Zimmermann, J. & Cremers, A. (2019). Foundations of Artificial Intelligence and Effective Universal Induction. In PAS-PASS Conference, Robotics, Artificial Intelligence and Humanity: Science, Ethics and Policy.
[17] Englert R. (1998). Learning Model Knowledge for 3D Building Reconstruction. PhD thesis. Bonn University. Faculty III. Germany.
[18] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781.
[19] Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and Application of Metric on Semantic Nets, IEEE Transactions on Systems, Man, and Cybernetics, 9 (1): 17-30.
[20] Lafferty, J., Sleator, D., & Temperley, D. (1992). Grammatical trigrams: A probabilistic model of link grammar (Vol. 56). School of Computer Science, Carnegie Mellon University. And https://www.link.cs.cmu.edu/link/index.html visited on January 2022.
[21] Stevenson, A. (Ed.). (2010). Oxford dictionary of English. Oxford University Press, USA.
[22] Broder, A. (2000). Identifying and filtering near-duplicate documents. In Annual Symposium on Combinatorial Pattern Matching, pp. 1-10. Springer, Berlin, Heidelberg.
[23] Leech, G., & Rayson, P. (2014). Word frequencies in written and spoken English: Based on the British National Corpus. Taylor & Francis, Routledge.
[24] Brennan, T., Dieterich, W., & Ehret, B. (2008): Evaluating the Predictive Validity of the Compas Risk and Needs Assessment. http://cjb.sagepub.com/cgi/content/abstract/36/1/21. On behalf of: International Association for Correctional and Forensic Psychology.
[25] Dressel, J., & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. Science advances, 4 (1), eaao5580.
[26] Banks, V. A., Plant, K. L., & Stanton, N. A. (2018). Driver error or designer error: Using the Perceptual Cycle Model to explore the circumstances surrounding the fatal Tesla crash on 7th May 2016. Safety science, 108, 278-285.
[27] https://www.english-corpora.org/iweb/, visited July 2021.
[28] Bias in the face recognition Software of Google. https://www.bbc.com/news/technology-45561955, visited December 2021.
[29] Oxford dictionary and similar words, https://www.oxfordlearnersdictionaries.com/definition/english/skin_1?q=skin, visited December 2021.
[30] Spärck, J. K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28: 11–21. CiteSeerX 10.1.1.115.8343. doi: 10.1108/eb026526.
[31] Semenov, I., & Arefin, S. (2019). English word frequencies from all English Wikipedia articles. https://github.com/IlyaSemenov/wikipedia-word-frequency, visited December 2021.
[32] Fellbaum, C. (2005). WordNet and wordnets. In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.
[33] Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D., Kammel, S.,... & Thrun, S. (2011). Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE intelligent vehicles symposium (IV), 163-168.
[34] Leech, G., Garside, R., & Bryant, M. (1994). CLAWS4: the tagging of the British National Corpus. In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics.
Cite This Article
  • APA Style

    Roman Englert, Jörg Muschiol. (2022). Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation. American Journal of Information Science and Technology, 6(1), 1-7. https://doi.org/10.11648/j.ajist.20220601.11

    Copy | Download

    ACS Style

    Roman Englert; Jörg Muschiol. Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation. Am. J. Inf. Sci. Technol. 2022, 6(1), 1-7. doi: 10.11648/j.ajist.20220601.11

    Copy | Download

    AMA Style

    Roman Englert, Jörg Muschiol. Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation. Am J Inf Sci Technol. 2022;6(1):1-7. doi: 10.11648/j.ajist.20220601.11

    Copy | Download

  • @article{10.11648/j.ajist.20220601.11,
      author = {Roman Englert and Jörg Muschiol},
      title = {Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation},
      journal = {American Journal of Information Science and Technology},
      volume = {6},
      number = {1},
      pages = {1-7},
      doi = {10.11648/j.ajist.20220601.11},
      url = {https://doi.org/10.11648/j.ajist.20220601.11},
      eprint = {https://download.sciencepg.com/pdf/10.11648.j.ajist.20220601.11},
      abstract = {The significance of Bias Detection has increased appreciably, due to the increased application of AI. Although syntactic bias is well explored with statistical techniques, there remains semantic bias challenge like for example, Google’s face recognition which excludes colored people. Human expertise is required to detect semantic bias, e.g., for the application of the root-out-bias method. We propose a further automatization to this laborious method, based on the Training Data Improvement for Bias Mitigation (TDIBM). The concept, is to automatically construct a Semantic Network (SN) from the domain description of the training. For the semantic network nouns are extracted. As a second step, synonyms and semantically similar nouns are searched, e.g. in dictionaries, and added to the SNs. As a result, the SN contains nouns that enhances the given domain, with previously unknown knowledge. This SN can be used to check with, e.g., the root-out bias method, whether the training sample is biased, or not. Should the training sample be biased, then the corresponding nouns from the SN can be added to the training sample set to mitigate the bias. The newly developed method, TDIBM is evaluated twofold: Firstly, with the description of the COMPAS system, which is a case management and decision support tool used by U.S. courts to assess the likelihood of a defendant becoming a recidivist. Secondly, an autonomous driving domain is applied, to investigate accidental driving of a Tesla car. Here TDIBM detected among many new features, including one to solve ambiguous scene interpretations for autonomous driving vehicles.},
     year = {2022}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Training Data Improvement by Automatic Generation of Semantic Networks for Bias Mitigation
    AU  - Roman Englert
    AU  - Jörg Muschiol
    Y1  - 2022/03/29
    PY  - 2022
    N1  - https://doi.org/10.11648/j.ajist.20220601.11
    DO  - 10.11648/j.ajist.20220601.11
    T2  - American Journal of Information Science and Technology
    JF  - American Journal of Information Science and Technology
    JO  - American Journal of Information Science and Technology
    SP  - 1
    EP  - 7
    PB  - Science Publishing Group
    SN  - 2640-0588
    UR  - https://doi.org/10.11648/j.ajist.20220601.11
    AB  - The significance of Bias Detection has increased appreciably, due to the increased application of AI. Although syntactic bias is well explored with statistical techniques, there remains semantic bias challenge like for example, Google’s face recognition which excludes colored people. Human expertise is required to detect semantic bias, e.g., for the application of the root-out-bias method. We propose a further automatization to this laborious method, based on the Training Data Improvement for Bias Mitigation (TDIBM). The concept, is to automatically construct a Semantic Network (SN) from the domain description of the training. For the semantic network nouns are extracted. As a second step, synonyms and semantically similar nouns are searched, e.g. in dictionaries, and added to the SNs. As a result, the SN contains nouns that enhances the given domain, with previously unknown knowledge. This SN can be used to check with, e.g., the root-out bias method, whether the training sample is biased, or not. Should the training sample be biased, then the corresponding nouns from the SN can be added to the training sample set to mitigate the bias. The newly developed method, TDIBM is evaluated twofold: Firstly, with the description of the COMPAS system, which is a case management and decision support tool used by U.S. courts to assess the likelihood of a defendant becoming a recidivist. Secondly, an autonomous driving domain is applied, to investigate accidental driving of a Tesla car. Here TDIBM detected among many new features, including one to solve ambiguous scene interpretations for autonomous driving vehicles.
    VL  - 6
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Computer Science, FOM University of Applied Sciences, Essen, Germany; New Media and Information Systems, Faculty III, Siegen University, Siegen, Germany

  • Computer Science, FOM University of Applied Sciences, Essen, Germany

  • Section