Named Entity Recognition and Generating Keys for your Datasets

Recently I've been exploring the Text Analytics capabilities of Azure ML, in particular "Entity Recognition".  Entity Recognition is a valuable Natural Language Processing technique that allows you to extract named entities (think Proper Nouns) from a body of text.  Azure ML Named Entity Recognition goes a step further by allowing you to identify the type of entity.  Currently 3 types of entities are supported, Name, Location and Organition (labelled "NAM", LOC", and "ORG" respectively).  Additionally the Named Entity Module gives you the entity name, the location in the original text, and the length of the string.

My first thought was, "WOW this is neat, now what can I use it for?"  That's a good question indeed.  Not being a Data Scientist or NLP Researcher, I wondered if I could use those Named Entities to improve my classification results somehow?  I'm working on an experiement to do just that and when it's completed I'll publish my results.  For the time being, I thought I'd share a simple technique that I found to label my dataset with identifiers.  Why would I want to do that?  Well if you remember the Named Entity Module gives us a handle to the article that the entity originated in.  My plan is to join the entities back into the original dataset so I can contrast using the original text with a list of entities.  The problem is that my original dataset does not have article identifiers in it (at least not sufficiently unique identifiers that I can join back to this new Entity data I've extracted).  I looked thorugh the modules and didn't see anything that would meet my needs.  So I quickly wrote this simple R Script and added it so a "Execute R Script" module to generate the identifiers and append them to the dataset.  Here's the R code to do that:

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1);

#Add "Article" column to the dataset
dataset1$Article = (1:nrow(dataset1)) - 1;

# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("dataset1");

The script 1) gets the dataset, 2) appends an column with the row number (minus 1 since the articles are zero-indexed) and 3) maps that dataset to the output.  Couldn't be simpler!  Hope this quick spin around the block gives you a taste of what you can do with R in Azure ML.  Till next time, Safe Travels!

No Comments

Add a Comment