Taming Text: How to Find, Organize, and Manipulate It

By Grant S. Ingersoll

Summary

Taming Text, winner of the 2013 Jolt Awards for productiveness, is a hands-on, example-driven advisor to operating with unstructured textual content within the context of real-world purposes. This publication explores tips on how to immediately manage textual content utilizing ways similar to full-text seek, right identify reputation, clustering, tagging, details extraction, and summarization. The publication publications you thru examples illustrating every one of those themes, in addition to the principles upon which they're built.

About this Book

There is quite a bit textual content in our lives, we're essentially drowningin it. thankfully, there are cutting edge instruments and techniquesfor coping with unstructured details that could throw thesmart developer a much-needed lifeline. you will discover them in thisbook.

Taming Text is a pragmatic, example-driven advisor to operating withtext in actual functions. This ebook introduces you to helpful suggestions like full-text seek, right identify recognition,clustering, tagging, info extraction, and summarization.You'll discover actual use instances as you systematically take in thefoundations upon which they're built.Written in a transparent and concise variety, this publication avoids jargon, explainingthe topic in phrases you could comprehend with out a backgroundin facts or typical language processing. Examples arein Java, however the ideas will be utilized in any language.

Written for Java builders, the ebook calls for no earlier wisdom of GWT.

buy of the print publication comes with a proposal of a unfastened PDF, ePub, and Kindle e-book from Manning. additionally on hand is all code from the e-book.

Winner of 2013 Jolt Awards: the easiest Books—one of 5 remarkable books each critical programmer should still read.

What's Inside

  • When to take advantage of text-taming techniques
  • Important open-source libraries like Solr and Mahout
  • How to construct text-processing applications

About the Authors

Grant Ingersoll is an engineer, speaker, and coach, a Lucenecommitter, and a cofounder of the Mahout machine-learning undertaking. Thomas Morton is the first developer of OpenNLP and greatest Entropy. Drew Farris is a know-how advisor, software program developer, and contributor to Mahout,Lucene, and Solr.

"Takes the secret out of verycomplex processes."—From the Foreword by way of Liz Liddy, Dean, iSchool, Syracuse University

Table of Contents

  1. Getting began taming text
  2. Foundations of taming text
  3. Searching
  4. Fuzzy string matching
  5. Identifying humans, locations, and things
  6. Clustering text
  7. Classification, categorization, and tagging
  8. Building an instance query answering system
  9. Untamed textual content: exploring the following frontier

Show description

Quick preview of Taming Text: How to Find, Organize, and Manipulate It PDF

Show sample text content

The subsequent sections -ed seemed previous demanding shape start up with fundamentals like string manipu-en taken prior participle shape lation after which proceed directly to extra complex goods reminiscent of complete sentence parsing. quite often conversing, the fundamentals can be used each day, while extra complex instruments like complete language parsers could purely be utilized in convinced purposes. 2. 2. 1 String manipulation instruments Libraries for operating with strings, personality arrays, and different textual content representations shape the root of so much text-based courses.

If either classifiers are high-precision, then combining them might be useful enhance bear in mind.  Use the output of the OpenNLP version for education data—This is a technique to bootstrap the volume of educating info on hand to exploit. it really works top if mixed with a few human correction. The OpenNLP types are proficient on newswire textual content, so most sensible effects will come from using them to related textual content. regardless of the case for customizing OpenNLP, the contents of the subsequent sections may also help you know the way to adopt the method.

For example, in determine 6. 1, the headline “Vikings commence Favre period at the street in Cleveland” indicates there are 2,181 different related tales to the most tale. even though it’s now not transparent what clustering algorithms Google is utilizing to enforce this selection, Google’s documentation sincerely states they’re utilizing clustering (Google information 2011): Our grouping expertise takes into consideration many elements, akin to titles, textual content, and ebook time. We then use numerous clustering algorithms to spot tales we expect are heavily similar.

Eventually, we outfitted functions that leverage those concepts and likewise leverage Solr as a platform to make development those functions effortless. within the subsequent bankruptcy, we’ll movement from evaluating strings to each other to discovering details inside of strings and files. four. five assets Aoe, Jun-ichi. 1989. “An effective electronic seek set of rules by utilizing a double-array constitution. ” IEEE Transactions on software program Engineering, 15, no. 9:1066–1077. selecting humans, areas, and issues during this bankruptcy  the fundamental innovations in the back of named-entity popularity  easy methods to use OpenNLP to discover named entities  OpenNLP functionality issues humans, areas, and things—nouns—play an important position in language, conveying the sentence’s topic and infrequently its item.

Org/wiki/Parsing. Winkler, William E. , and Thibaudeau, Yves. 1991. “An software of the FellegiSunter version of list Linkage to the 1990 U. S. Decennial Census. ” Statistical examine document sequence RR91/09, U. S. Bureau of the Census, Washington, D. C. looking out during this bankruptcy  knowing seek concept and the fundamentals of the vector house version  constructing Apache Solr  Making content material searchable  developing queries for Solr  figuring out seek functionality seek, as a function of an program or as an finish program in itself, wishes little creation.

Download PDF sample

Rated 4.52 of 5 – based on 42 votes