Entity Extractor

The Entity Extractor skill guides you through implementing named entity recognition (NER) systems that identify and classify entities in text. From people and organizations to domain-specific entities like products, medical terms, or financial instruments, this skill covers extraction approaches from simple pattern matching to advanced neural models.

Entity extraction is a foundational NLP task that powers applications from search engines to knowledge graphs. Getting it right requires understanding your domain, choosing appropriate techniques, and handling the inherent ambiguity in natural language.

Whether you need to extract standard entity types, define custom entities for your domain, or build relation extraction on top of entity recognition, this skill ensures your extraction pipeline is accurate and maintainable.

Core Workflows

Workflow 1: Choose Extraction Approach

Define target entities:
- Standard types: PERSON, ORG, LOCATION, DATE, MONEY
- Domain-specific: PRODUCT, SYMPTOM, GENE, CONTRACT
- Relations: connections between entities
Assess available resources:
- Labeled training data
- Domain expertise
- Compute constraints

Select approach:

Approach	Training Data	Accuracy	Speed	Customization
spaCy (pre-trained)	None	Good	Very fast	Limited
Rule-based	None	Variable	Fast	High
Fine-tuned BERT	100s-1000s	Excellent	Medium	Full
LLM (zero-shot)	None	Good	Slow	Prompt-based
LLM (few-shot)	Few examples	Very good	Slow	Prompt-based

Plan implementation and evaluation

Workflow 2: Implement Entity Extraction Pipeline

Set up extraction:

import spacy

class EntityExtractor:
    def __init__(self, model="en_core_web_trf"):
        self.nlp = spacy.load(model)

    def extract(self, text):
        doc = self.nlp(text)
        entities = []
        for ent in doc.ents:
            entities.append({
                "text": ent.text,
                "type": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
                "confidence": getattr(ent, "confidence", None)
            })
        return entities

    def extract_batch(self, texts):
        docs = list(self.nlp.pipe(texts))
        return [self.extract_from_doc(doc) for doc in docs]

Post-process entities:
- Normalize variations (IBM vs I.B.M.)
- Resolve abbreviations
- Link to knowledge base
Validate extraction quality
Handle edge cases

Workflow 3: Build Custom Entity Recognizer

Prepare training data:

# Format for spaCy training
TRAIN_DATA = [
    ("Apple released the new iPhone today.", {
        "entities": [(0, 5, "ORG"), (24, 30, "PRODUCT")]
    }),
    ("Dr. Smith prescribed metformin for diabetes.", {
        "entities": [(0, 9, "PERSON"), (21, 30, "DRUG"), (35, 43, "CONDITION")]
    })
]

Configure training:

# spaCy config for NER training
config = {
    "training": {
        "optimizer": {"learn_rate": 0.001},
        "batch_size": {"@schedules": "compounding", "start": 4, "stop": 32}
    },
    "components": {
        "ner": {
            "factory": "ner",
            "model": {"@architectures": "spacy.TransitionBasedParser"}
        }
    }
}

Train model:

python -m spacy train config.cfg --output ./models --paths.train ./train.spacy --paths.dev ./dev.spacy

Evaluate on held-out data
Iterate based on errors

Quick Reference

Action	Command/Trigger
Extract entities	"Extract entities from [text]"
Choose NER model	"Best NER for [domain]"
Custom entities	"Train custom entity recognizer"
Evaluate NER	"Evaluate entity extraction quality"
Handle ambiguity	"Resolve ambiguous entities"
Entity linking	"Link entities to knowledge base"

Best Practices

Start with Pre-trained: Don't train from scratch unnecessarily
- spaCy, Hugging Face, and cloud APIs cover common entities
- Test pre-trained models first
- Fine-tune only when needed
Define Clear Guidelines: Entity boundaries are ambiguous
- "Dr. John Smith" - one entity or two?
- "New York Times" - ORG or GPE?
- Create and follow consistent annotation guidelines
Handle Nested Entities: Some entities contain others
- "Bank of America headquarters" (ORG inside LOCATION)
- Decide on nesting strategy upfront
- Some models support flat only; others handle nested
Normalize Extracted Entities: Raw text has variations
- "IBM", "I.B.M.", "International Business Machines"
- Canonicalize to standard form
- Link to knowledge base IDs when possible
**E

...

entity-extractor

SKILL.md