263 lines
9.2 KiB
Text
263 lines
9.2 KiB
Text
.. Copyright (C) 2001-2018 NLTK Project
|
|
.. For license information, see LICENSE.TXT
|
|
|
|
======================
|
|
Information Extraction
|
|
======================
|
|
|
|
Information Extraction standardly consists of three subtasks:
|
|
|
|
#. Named Entity Recognition
|
|
|
|
#. Relation Extraction
|
|
|
|
#. Template Filling
|
|
|
|
Named Entities
|
|
~~~~~~~~~~~~~~
|
|
|
|
The IEER corpus is marked up for a variety of Named Entities. A `Named
|
|
Entity`:dt: (more strictly, a Named Entity mention) is a name of an
|
|
entity belonging to a specified class. For example, the Named Entity
|
|
classes in IEER include PERSON, LOCATION, ORGANIZATION, DATE and so
|
|
on. Within NLTK, Named Entities are represented as subtrees within a
|
|
chunk structure: the class name is treated as node label, while the
|
|
entity mention itself appears as the leaves of the subtree. This is
|
|
illustrated below, where we have show an extract of the chunk
|
|
representation of document NYT_19980315.064:
|
|
|
|
>>> from nltk.corpus import ieer
|
|
>>> docs = ieer.parsed_docs('NYT_19980315')
|
|
>>> tree = docs[1].text
|
|
>>> print(tree) # doctest: +ELLIPSIS
|
|
(DOCUMENT
|
|
...
|
|
``It's
|
|
a
|
|
chance
|
|
to
|
|
think
|
|
about
|
|
first-level
|
|
questions,''
|
|
said
|
|
Ms.
|
|
(PERSON Cohn)
|
|
,
|
|
a
|
|
partner
|
|
in
|
|
the
|
|
(ORGANIZATION McGlashan & Sarrail)
|
|
firm
|
|
in
|
|
(LOCATION San Mateo)
|
|
,
|
|
(LOCATION Calif.)
|
|
...)
|
|
|
|
Thus, the Named Entity mentions in this example are *Cohn*, *McGlashan &
|
|
Sarrail*, *San Mateo* and *Calif.*.
|
|
|
|
The CoNLL2002 Dutch and Spanish data is treated similarly, although in
|
|
this case, the strings are also POS tagged.
|
|
|
|
>>> from nltk.corpus import conll2002
|
|
>>> for doc in conll2002.chunked_sents('ned.train')[27]:
|
|
... print(doc)
|
|
(u'Het', u'Art')
|
|
(ORG Hof/N van/Prep Cassatie/N)
|
|
(u'verbrak', u'V')
|
|
(u'het', u'Art')
|
|
(u'arrest', u'N')
|
|
(u'zodat', u'Conj')
|
|
(u'het', u'Pron')
|
|
(u'moest', u'V')
|
|
(u'worden', u'V')
|
|
(u'overgedaan', u'V')
|
|
(u'door', u'Prep')
|
|
(u'het', u'Art')
|
|
(u'hof', u'N')
|
|
(u'van', u'Prep')
|
|
(u'beroep', u'N')
|
|
(u'van', u'Prep')
|
|
(LOC Antwerpen/N)
|
|
(u'.', u'Punc')
|
|
|
|
Relation Extraction
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
Relation Extraction standardly consists of identifying specified
|
|
relations between Named Entities. For example, assuming that we can
|
|
recognize ORGANIZATIONs and LOCATIONs in text, we might want to also
|
|
recognize pairs *(o, l)* of these kinds of entities such that *o* is
|
|
located in *l*.
|
|
|
|
The `sem.relextract` module provides some tools to help carry out a
|
|
simple version of this task. The `tree2semi_rel()` function splits a chunk
|
|
document into a list of two-member lists, each of which consists of a
|
|
(possibly empty) string followed by a `Tree` (i.e., a Named Entity):
|
|
|
|
>>> from nltk.sem import relextract
|
|
>>> pairs = relextract.tree2semi_rel(tree)
|
|
>>> for s, tree in pairs[18:22]:
|
|
... print('("...%s", %s)' % (" ".join(s[-5:]),tree))
|
|
("...about first-level questions,'' said Ms.", (PERSON Cohn))
|
|
("..., a partner in the", (ORGANIZATION McGlashan & Sarrail))
|
|
("...firm in", (LOCATION San Mateo))
|
|
("...,", (LOCATION Calif.))
|
|
|
|
The function `semi_rel2reldict()` processes triples of these pairs, i.e.,
|
|
pairs of the form ``((string1, Tree1), (string2, Tree2), (string3,
|
|
Tree3))`` and outputs a dictionary (a `reldict`) in which ``Tree1`` is
|
|
the subject of the relation, ``string2`` is the filler
|
|
and ``Tree3`` is the object of the relation. ``string1`` and ``string3`` are
|
|
stored as left and right context respectively.
|
|
|
|
>>> reldicts = relextract.semi_rel2reldict(pairs)
|
|
>>> for k, v in sorted(reldicts[0].items()):
|
|
... print(k, '=>', v) # doctest: +ELLIPSIS
|
|
filler => of messages to their own ``Cyberia'' ...
|
|
lcon => transactions.'' Each week, they post
|
|
objclass => ORGANIZATION
|
|
objsym => white_house
|
|
objtext => White House
|
|
rcon => for access to its planned
|
|
subjclass => CARDINAL
|
|
subjsym => hundreds
|
|
subjtext => hundreds
|
|
untagged_filler => of messages to their own ``Cyberia'' ...
|
|
|
|
The next example shows some of the values for two `reldict`\ s
|
|
corresponding to the ``'NYT_19980315'`` text extract shown earlier.
|
|
|
|
>>> for r in reldicts[18:20]:
|
|
... print('=' * 20)
|
|
... print(r['subjtext'])
|
|
... print(r['filler'])
|
|
... print(r['objtext'])
|
|
====================
|
|
Cohn
|
|
, a partner in the
|
|
McGlashan & Sarrail
|
|
====================
|
|
McGlashan & Sarrail
|
|
firm in
|
|
San Mateo
|
|
|
|
The function `relextract()` allows us to filter the `reldict`\ s
|
|
according to the classes of the subject and object named entities. In
|
|
addition, we can specify that the filler text has to match a given
|
|
regular expression, as illustrated in the next example. Here, we are
|
|
looking for pairs of entities in the IN relation, where IN has
|
|
signature <ORG, LOC>.
|
|
|
|
>>> import re
|
|
>>> IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
|
|
>>> for fileid in ieer.fileids():
|
|
... for doc in ieer.parsed_docs(fileid):
|
|
... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
|
|
... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
|
|
[ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
|
|
[ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
|
|
[ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across intersections in' [LOC: 'Beirut']
|
|
[ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
|
|
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
|
|
[ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
|
|
[ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
|
|
[ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
|
|
[ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
|
|
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
|
|
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
|
|
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
|
|
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
|
|
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
|
|
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
|
|
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
|
|
...
|
|
|
|
The next example illustrates a case where the patter is a disjunction
|
|
of roles that a PERSON can occupy in an ORGANIZATION.
|
|
|
|
>>> roles = """
|
|
... (.*(
|
|
... analyst|
|
|
... chair(wo)?man|
|
|
... commissioner|
|
|
... counsel|
|
|
... director|
|
|
... economist|
|
|
... editor|
|
|
... executive|
|
|
... foreman|
|
|
... governor|
|
|
... head|
|
|
... lawyer|
|
|
... leader|
|
|
... librarian).*)|
|
|
... manager|
|
|
... partner|
|
|
... president|
|
|
... producer|
|
|
... professor|
|
|
... researcher|
|
|
... spokes(wo)?man|
|
|
... writer|
|
|
... ,\sof\sthe?\s* # "X, of (the) Y"
|
|
... """
|
|
>>> ROLES = re.compile(roles, re.VERBOSE)
|
|
>>> for fileid in ieer.fileids():
|
|
... for doc in ieer.parsed_docs(fileid):
|
|
... for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
|
|
... print(relextract.rtuple(rel)) # doctest: +ELLIPSIS
|
|
[PER: 'Kivutha Kibwana'] ', of the' [ORG: 'National Convention Assembly']
|
|
[PER: 'Boban Boskovic'] ', chief executive of the' [ORG: 'Plastika']
|
|
[PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
|
|
[PER: 'Kiriyenko'] 'became a foreman at the' [ORG: 'Krasnoye Sormovo']
|
|
[PER: 'Annan'] ', the first sub-Saharan African to head the' [ORG: 'United Nations']
|
|
[PER: 'Mike Godwin'] ', chief counsel for the' [ORG: 'Electronic Frontier Foundation']
|
|
...
|
|
|
|
In the case of the CoNLL2002 data, we can include POS tags in the
|
|
query pattern. This example also illustrates how the output can be
|
|
presented as something that looks more like a clause in a logical language.
|
|
|
|
>>> de = """
|
|
... .*
|
|
... (
|
|
... de/SP|
|
|
... del/SP
|
|
... )
|
|
... """
|
|
>>> DE = re.compile(de, re.VERBOSE)
|
|
>>> rels = [rel for doc in conll2002.chunked_sents('esp.train')
|
|
... for rel in relextract.extract_rels('ORG', 'LOC', doc, corpus='conll2002', pattern = DE)]
|
|
>>> for r in rels[:10]:
|
|
... print(relextract.clause(r, relsym='DE')) # doctest: +NORMALIZE_WHITESPACE
|
|
DE(u'tribunal_supremo', u'victoria')
|
|
DE(u'museo_de_arte', u'alcorc\xf3n')
|
|
DE(u'museo_de_bellas_artes', u'a_coru\xf1a')
|
|
DE(u'siria', u'l\xedbano')
|
|
DE(u'uni\xf3n_europea', u'pek\xedn')
|
|
DE(u'ej\xe9rcito', u'rogberi')
|
|
DE(u'juzgado_de_instrucci\xf3n_n\xfamero_1', u'san_sebasti\xe1n')
|
|
DE(u'psoe', u'villanueva_de_la_serena')
|
|
DE(u'ej\xe9rcito', u'l\xedbano')
|
|
DE(u'juzgado_de_lo_penal_n\xfamero_2', u'ceuta')
|
|
>>> vnv = """
|
|
... (
|
|
... is/V|
|
|
... was/V|
|
|
... werd/V|
|
|
... wordt/V
|
|
... )
|
|
... .*
|
|
... van/Prep
|
|
... """
|
|
>>> VAN = re.compile(vnv, re.VERBOSE)
|
|
>>> for doc in conll2002.chunked_sents('ned.train'):
|
|
... for r in relextract.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
|
|
... print(relextract.clause(r, relsym="VAN"))
|
|
VAN(u"cornet_d'elzius", u'buitenlandse_handel')
|
|
VAN(u'johan_rottiers', u'kardinaal_van_roey_instituut')
|
|
VAN(u'annie_lennox', u'eurythmics')
|