Two months have passed since the Intel® AI Lab announced and released the first version of NLP Architect, an open and flexible library for NLP research and the developer community. We have been busy working on new additions and we are happy to announce the release of version 0.2. This new version includes several new models, many improvements of the current library, the first of several applications based on NLP Architect models, and a set of Jupyter* Notebook tutorials.
Here are the NLP Architect release 0.2 highlights:
The following new NLP models were added:
- Unsupervised Cross-lingual Embeddings - documentation
- Temporal Convolution Network (TCN) - documentation
- Supervised sentiment analysis - documentation
- Streamlined TensorFlow* and Keras* support on models, e.g. Reading comprehension and Memory network for dialogue, Word chunker, NP semantic segmentation
- A stronger baseline Word Chunker model based on “Deep Multi-task Learning with Low-level Tasks Supervised at Lower Layers” by Søgaard et. al.
New library features:
- New services infrastructure and updated demos
- A noun phrase extraction pipeline and plug-in for spaCy*
- Jupyter Notebook tutorials that were shown in hands-on workshops at AIDC 2018:
- A Publications page highlighting current and future publications from our team—blog posts, conference proceedings, and audio/visual material.
- New dataset loaders: Amazon* reviews, PTB subset, Wikitext-103, Fasttext embeddings and Wikipedia dumps.
In this version we put an emphasis on showcasing how NLP applications can be built using the models and building-blocks available in NLP Architect and how it enables easy integration into your code. We include Term Set Expansion—a corpus-based application for expanding sets of terms into a more complete set with additional terms of the same semantic class.
Noun Phrase Annotator and Plug-in for spaCy
Noun phrase extraction from a given corpus is often needed by NLP use cases, including the preprocessing stage of the Term set expansion application. We developed a neural noun phrase extractor application that uses a trained chunker (code) with an additional set of heuristic rules to improve the completeness of the extracted noun phrases. In order to make noun phrase extraction easier to use and to benefit from the powerful annotation abilities and usability of the NLP pipeline provided by spaCy, we included an implementation of an annotator compatible with linking into spaCy’s pipeline.
Linking to spaCy could not be easier, as shown in the example below:
After running the above code, spaCy will run the noun phrase annotator for each given document and save the annotations in the document’s metadata space.
The following example shows how to get the noun phrase annotations using a simple pipeline with loads:
E2E Term Set Expansion
As we discussed in a previous blog, building a stack of NLP components based on the latest DL technologies allows us to build foundations to tackle many applications. In this version, we demonstrate these advantages with the Term Set Expansion application.
Term set expansion is an algorithm to iteratively expand a seed set of terms into a more complete set of terms that belong to the same semantic class while processing the corpus. For example, given the partial set of personal assistant application terms like ‘Siri’ and ‘Cortana’ as seed, the expanded set is expected to include terms such as ‘Amazon Echo’ and ‘Google Now’. This capability is imperative in many NLP real world use-cases that need to extract terms related to the same semantic category, for example, building and expanding in-domain taxonomy, document similarity, Chat-bot configuration, information extraction, and text analytics.
This application enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set, and store it, thus simplifying the extraction of domain-specific fine-grained semantic classes.
Here is an example of expansion results as presented by the UI. Seed terms are ‘signal processing’ and ‘computer vision:
For more information on the algorithm, workflow and real use-case examples please refer to our demo paper. A concept video demo of the Term Set Expansion application is available here (some images were blurred for privacy reasons).
Easy to build and deploy NLP applications for the real world
Term Set Expansion is our first application that presents an end-to-end workflow, deploys corpus-based learning and doesn't require ML/NLP or coding expertise. We are going to continue our efforts to demonstrate how easy and simple it is to deploy NLP applications in the real world. In future versions, we are planning to provide more reproducible SOTA DL model implementations and add more end-to-end applications like Topic/Trend extraction, Sentiment Analysis, and Event Extraction. We are also researching unsupervised and semi-supervised methods that introduce interpretable NLU models, by utilizing the innate linguistic structure, and targeting robust and domain-adaptive NLP applications.
Getting Started with NLP Architect
Developers can start by downloading the code from our GitHub repository and following the instructions for installing NLP Architect. A Comprehensive documentation for all the core modules and end-to-end examples can be found here. We look forward to receiving feedback, feature requests and pull request contributions from all users.
Credits go to our team of NLP researchers and developers at the Intel AI Lab: Peter Izsak, Anna Bethke, Daniel Korat, Amit Yaccobi, Andy Keller, Jonathan Mamou, Shira Guskin, Sharath Nittur Sridhar, Oren Pereg, Alon Eirew, Sapir Tsabari, Yael Green, Chinnikrishna Kothapalli.
Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation.