entity-extractor

from eddiebe147/claude-settings

No description

6 stars1 forksUpdated Jan 22, 2026
npx skills add https://github.com/eddiebe147/claude-settings --skill entity-extractor

SKILL.md

Entity Extractor

The Entity Extractor skill guides you through implementing named entity recognition (NER) systems that identify and classify entities in text. From people and organizations to domain-specific entities like products, medical terms, or financial instruments, this skill covers extraction approaches from simple pattern matching to advanced neural models.

Entity extraction is a foundational NLP task that powers applications from search engines to knowledge graphs. Getting it right requires understanding your domain, choosing appropriate techniques, and handling the inherent ambiguity in natural language.

Whether you need to extract standard entity types, define custom entities for your domain, or build relation extraction on top of entity recognition, this skill ensures your extraction pipeline is accurate and maintainable.

Core Workflows

Workflow 1: Choose Extraction Approach

  1. Define target entities:
    • Standard types: PERSON, ORG, LOCATION, DATE, MONEY
    • Domain-specific: PRODUCT, SYMPTOM, GENE, CONTRACT
    • Relations: connections between entities
  2. Assess available resources:
    • Labeled training data
    • Domain expertise
    • Compute constraints
  3. Select approach:
    ApproachTraining DataAccuracySpeedCustomization
    spaCy (pre-trained)NoneGoodVery fastLimited
    Rule-basedNoneVariableFastHigh
    Fine-tuned BERT100s-1000sExcellentMediumFull
    LLM (zero-shot)NoneGoodSlowPrompt-based
    LLM (few-shot)Few examplesVery goodSlowPrompt-based
  4. Plan implementation and evaluation

Workflow 2: Implement Entity Extraction Pipeline

  1. Set up extraction:
    import spacy
    
    class EntityExtractor:
        def __init__(self, model="en_core_web_trf"):
            self.nlp = spacy.load(model)
    
        def extract(self, text):
            doc = self.nlp(text)
            entities = []
            for ent in doc.ents:
                entities.append({
                    "text": ent.text,
                    "type": ent.label_,
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "confidence": getattr(ent, "confidence", None)
                })
            return entities
    
        def extract_batch(self, texts):
            docs = list(self.nlp.pipe(texts))
            return [self.extract_from_doc(doc) for doc in docs]
    
  2. Post-process entities:
    • Normalize variations (IBM vs I.B.M.)
    • Resolve abbreviations
    • Link to knowledge base
  3. Validate extraction quality
  4. Handle edge cases

Workflow 3: Build Custom Entity Recognizer

  1. Prepare training data:
    # Format for spaCy training
    TRAIN_DATA = [
        ("Apple released the new iPhone today.", {
            "entities": [(0, 5, "ORG"), (24, 30, "PRODUCT")]
        }),
        ("Dr. Smith prescribed metformin for diabetes.", {
            "entities": [(0, 9, "PERSON"), (21, 30, "DRUG"), (35, 43, "CONDITION")]
        })
    ]
    
  2. Configure training:
    # spaCy config for NER training
    config = {
        "training": {
            "optimizer": {"learn_rate": 0.001},
            "batch_size": {"@schedules": "compounding", "start": 4, "stop": 32}
        },
        "components": {
            "ner": {
                "factory": "ner",
                "model": {"@architectures": "spacy.TransitionBasedParser"}
            }
        }
    }
    
  3. Train model:
    python -m spacy train config.cfg --output ./models --paths.train ./train.spacy --paths.dev ./dev.spacy
    
  4. Evaluate on held-out data
  5. Iterate based on errors

Quick Reference

ActionCommand/Trigger
Extract entities"Extract entities from [text]"
Choose NER model"Best NER for [domain]"
Custom entities"Train custom entity recognizer"
Evaluate NER"Evaluate entity extraction quality"
Handle ambiguity"Resolve ambiguous entities"
Entity linking"Link entities to knowledge base"

Best Practices

  • Start with Pre-trained: Don't train from scratch unnecessarily

    • spaCy, Hugging Face, and cloud APIs cover common entities
    • Test pre-trained models first
    • Fine-tune only when needed
  • Define Clear Guidelines: Entity boundaries are ambiguous

    • "Dr. John Smith" - one entity or two?
    • "New York Times" - ORG or GPE?
    • Create and follow consistent annotation guidelines
  • Handle Nested Entities: Some entities contain others

    • "Bank of America headquarters" (ORG inside LOCATION)
    • Decide on nesting strategy upfront
    • Some models support flat only; others handle nested
  • Normalize Extracted Entities: Raw text has variations

    • "IBM", "I.B.M.", "International Business Machines"
    • Canonicalize to standard form
    • Link to knowledge base IDs when possible
  • **E

...

Read full content

Repository Stats

Stars6
Forks1