Project List:

1. Hydra: Named Entity Resolution in Python

2. Interplanetary Field Enhancements: Automated Identification of Events and Search for the Source of the Event




University of Colorado Boulder Honors Thesis – Hail Hydra: Named Entity Resolution, Extraction, and Linking of Lexically Similar Names

“Words, words, words (Hamlet 2.2 18) Characters and ideas in text are represented by names. A casual reader would have no trouble understanding that a passing reference to Mr. Holmes, Mr. Sherlock Holmes, Sherlock Holmes, and Holmes all trace back to the world’s most famous detective. Names are often shortened or rearranged with common abbreviation or elaborate titles. Each version of a character’s name can be understood as a single head on a multi-headed hydra, all tracing back to the same body. Raw text analysis requires more literary context about how English is structured and how words in a sentence interact to generate the most accurate named entities possible. Many intelligent-dependency parsers and natural language processing systems study text without accounting for how dynamic language can be. This thesis considers the entire body of a piece of literature to identify and relate entities within the same text, regardless of the fluid nature of the exact reference to an entity in literature. Once an entity has been identified, lexically similar names, which refer to the same character, can be linked together to form a global named entity that represents all forms of the named entity referenced in the text. By utilizing raw text as opposed to labeled corpus, this thesis will generate named entities from the text.”

Examples of the Global Named Entities Found:

Captain Rollo Bickersteth of the Coldstream (My Man Jeeves)

Wicked Witch of the West (Wonderful Wizard of Oz)

Sydney Cecil Vivian Montmorency (Little Princess)

Superior of the Academy of the Presentation of the Blessed Virgin (Love in the Time of Cholera)

General Manager of the River Company of the Caribbean (Love in the Time of Cholera)

Eleventh Edition of the Newspeak Dictionary (1984)

Networks of Interactions

Once the text has been tagged, and the global named entities have been identified, I used this script to generate a network of interactions to show how characters and ideas in the text are interacting, including both frequency and sentiment.

The Wonderful Wizard of Oz (Baum)

More networks of interactions for specific texts can be found here

Tagging text for Pronouns and Proper nouns

One morning, when <Gregor Samsa>_n0 woke from troubled dreams, <he>_p0 found <himself>_p1 transformed in <his>_p2 bed into a horrible vermin. <He>_p3 lay on <his>_p4 armour-like back, and if <he>_p7 lifted <his>_p5 head a little <he>_p8 could see <his>_p5 brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover <it>_p9 and seemed ready to slide off any moment. <His>_p10 many legs, pitifully thin compared with the size of the rest of <him>_p11, waved about helplessly as <he>_p12 looked.

<Scarlett O'Hara>_n0 was not beautiful, but men seldom realized <it>_p0 when caught by <her>_p1 charm as the <Tarleton>_n1 twins were. In <her>_p2 face were too sharply blended the delicate features of <her>_p3 mother, a <Coast>_n2 aristocrat of French descent, and the heavy ones of <her>_p3 florid Irish father.

Gender Name Classifier (Decision Trees)

To better identify features of interactions, I trained a decision tree to determine the gender of a name, which was updated in a later step to verify the gender based on gendered honorifics like ‘Mr’ or ‘Queen’.

The name ‘Atticus’ is most likely Male

Odds: Female (0.215384615385), Male (0.784615384615)

The name ‘Ishamel’ is most likely Male

Odds: Female (0.4), Male (0.6)

Identify Main Characters

In order to get a summary of the text, combining this information I was able to generate a simple summary of the text and determine who the main character or who the text was focusing on. This distinction is particular important for first person texts, like Sherlock Holmes, where the story is narrated by Dr. Watson, but is focused on the titular character.

The Wonderful Wizard of Oz (Baum)


CHARACTER OF INTEREST: [(‘Dorothy’, 345)]

ADDITIONAL TOP CHARACTERS OF INTEREST: [(‘Wise Scarecrow’, 224), (‘Tin Woodman’, 180), (‘Cowardly Lion’, 176), (‘Wonderful City of Oz’, 159), (‘Wicked Witch of the West’, 126)]

Sherlock Holmes (Doyle)


CHARACTER OF INTEREST: [(‘Mister Sherlock Holmes’, 453)]

ADDITIONAL TOP CHARACTERS OF INTEREST: [(‘Dr Watson’, 80), (‘City of London’, 51), (‘Mr Lestrade of Scotland Yard’, 48), (‘Mr John Turner’, 40), (‘Mr James Windibank’, 38)]

More examples can be found here

Differences in Sentiment between Female and Male Characters during the course of a story

With the gender and characters identified, I automated the generation of graphs for any novel with statistically significant differences in the sentiment differed between male and female characters. From my list of novels, this left two texts: Princess of Mars and The Scarlet Letter where in both cases the female characters were statistically more likely to associated with negative words.

As an example, I have included the polarity for Princess of Mars below.

Script available with examples and instructions on Github

Interplanetary Field Enhancements (IFEs)

Automating the Search for IFEs in ACE Magnetometer Data

Looking for Interplanetary Field Enhancements (IFEs) (Russell et al. 1985a) in ACE magnetometer data (GSE coordinates). IFEs were first identified in associated with the passage of an asteroid with the Venusian orbit and are believed to be the result of charged dust interactions with the flowing solar wind. The evolution and geoeffectiveness of IFEs is still an area of active research, so fast and objective identification of IFEs at 1 AU is important.

Selection Criteria (Lai et al. 2017):

1. Total Magnetic field enhancement > 25% (relative to ambient |B|)

2. Duration of enhancement > 10 minutes

3. Current sheet is present at or within the peak of |B|

From the original ACE magnetometer data, I constructed a Python script to automatically identify and generate graphs of potential events based on these criteria. This greatly improved the speed of finding potential events since it had been previously done by hand. This script can search through a year’s worth of data automatically and return all potential events with generated graphs and timestamps.

Example of an identified event:


Script available with examples and instructions on Github

Identifying the Possible Dust Source Correlated with an IFE

Interplanetary Field Enhancements (IFEs) were first discovered within the Venusian orbit and were believed to be generated by charged dust mass-loading interplanetary magnetic fields, originally the asteroid 2201 Oljato (Russell 1987). However, the dust source hypothesis for IFEs remains a controversial stance. This program attempts to correlate IFEs measured near-Earth in the solar wind to small bodies which could be a source of dust.

A strong candidate dust source will be a small body that has an orbit inclination close to the XY plane (GSE) of the spacecraft and be the region around the time the IFEs were found (i.e. have a small phase difference) for multiple orbits. Because of the large gyroradius of the charged dust, the cloud will travel approximately radially from the source.

The program first identifies possible dust sources for each event and then compares subsequent periods of the small bodies identified to find the percentage of the time IFEs are again seen.

Identification is broken into steps that the Python script will allow to attempt to identify the candidate dust sources:

1. Find all small bodies nearby in phase and inclination

2. Find orbital period

3. Check all subsequent periods for IFEs

4. Compare how often subsequent periods produce an IFE

Subsequent statistically analysis will be performed through the entire orbit of promising dust source candidates on how the frequency of IFEs changes at different points in the body’s orbit (see Russell 1987).

Data was collected from the NASA JPL Asteriod team’s NeoWs (Near Earth Object Web Service) API and pre-processing and identification was done through a Python script I wrote.

Script available with examples and instructions on Github