You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

51 lines
1.7 KiB

4 years ago
  1. # Natural Language Toolkit: Word Sense Disambiguation Algorithms
  2. #
  3. # Authors: Liling Tan <alvations@gmail.com>,
  4. # Dmitrijs Milajevs <dimazest@gmail.com>
  5. #
  6. # Copyright (C) 2001-2019 NLTK Project
  7. # URL: <http://nltk.org/>
  8. # For license information, see LICENSE.TXT
  9. from nltk.corpus import wordnet
  10. def lesk(context_sentence, ambiguous_word, pos=None, synsets=None):
  11. """Return a synset for an ambiguous word in a context.
  12. :param iter context_sentence: The context sentence where the ambiguous word
  13. occurs, passed as an iterable of words.
  14. :param str ambiguous_word: The ambiguous word that requires WSD.
  15. :param str pos: A specified Part-of-Speech (POS).
  16. :param iter synsets: Possible synsets of the ambiguous word.
  17. :return: ``lesk_sense`` The Synset() object with the highest signature overlaps.
  18. This function is an implementation of the original Lesk algorithm (1986) [1].
  19. Usage example::
  20. >>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n')
  21. Synset('savings_bank.n.02')
  22. [1] Lesk, Michael. "Automatic sense disambiguation using machine
  23. readable dictionaries: how to tell a pine cone from an ice cream
  24. cone." Proceedings of the 5th Annual International Conference on
  25. Systems Documentation. ACM, 1986.
  26. http://dl.acm.org/citation.cfm?id=318728
  27. """
  28. context = set(context_sentence)
  29. if synsets is None:
  30. synsets = wordnet.synsets(ambiguous_word)
  31. if pos:
  32. synsets = [ss for ss in synsets if str(ss.pos()) == pos]
  33. if not synsets:
  34. return None
  35. _, sense = max(
  36. (len(context.intersection(ss.definition().split())), ss) for ss in synsets
  37. )
  38. return sense