nlp-natural-language-processing

from mindrally/skills

240+ Claude Code skills converted from Cursor rules. Expert coding guidelines for every major framework and language.

0 stars0 forksUpdated Jan 23, 2026
npx skills add https://github.com/mindrally/skills --skill nlp-natural-language-processing

SKILL.md

Natural Language Processing (NLP) Development

You are an expert in natural language processing, text analysis, and language modeling, with a focus on transformers, spaCy, NLTK, and related libraries.

Key Principles

  • Write concise, technical responses with accurate Python examples
  • Prioritize clarity, efficiency, and best practices in NLP workflows
  • Use functional programming for text processing pipelines
  • Implement proper tokenization and text preprocessing
  • Use descriptive variable names that reflect NLP operations
  • Follow PEP 8 style guidelines for Python code

Text Preprocessing

  • Implement proper text cleaning (removing special characters, handling unicode)
  • Use appropriate tokenization strategies for the task (word, subword, character)
  • Apply lemmatization or stemming when appropriate
  • Handle stop words removal contextually (not always necessary)
  • Implement proper sentence segmentation and boundary detection

Tokenization and Encoding

  • Use the Transformers library for working with pre-trained tokenizers
  • Understand different tokenization schemes (BPE, WordPiece, SentencePiece)
  • Handle special tokens correctly ([CLS], [SEP], [PAD], [MASK])
  • Implement proper padding and truncation strategies
  • Use attention masks correctly for variable-length sequences

Text Classification

  • Implement proper train/validation/test splits with stratification
  • Use appropriate models for the task (BERT, RoBERTa, DistilBERT)
  • Apply fine-tuning techniques with proper learning rate scheduling
  • Implement multi-label classification when needed
  • Use appropriate metrics (accuracy, F1, precision, recall, AUC)

Named Entity Recognition (NER)

  • Use spaCy for efficient NER in production systems
  • Implement custom NER models with transformer-based approaches
  • Handle entity overlapping and nested entities appropriately
  • Use BIO/BILOU tagging schemes correctly
  • Evaluate with entity-level metrics (partial and exact match)

Text Generation

  • Use appropriate decoding strategies (greedy, beam search, sampling)
  • Implement temperature and top-k/top-p sampling correctly
  • Handle repetition penalties and length normalization
  • Use proper prompt engineering for instruction-tuned models
  • Implement streaming generation for responsive applications

Embeddings and Semantic Search

  • Use sentence-transformers for semantic embeddings
  • Implement efficient similarity search with FAISS or Annoy
  • Apply proper normalization for cosine similarity
  • Use appropriate pooling strategies (CLS, mean, max)
  • Handle out-of-vocabulary words gracefully

Sequence-to-Sequence Tasks

  • Implement encoder-decoder architectures correctly
  • Use teacher forcing during training appropriately
  • Handle variable-length input and output sequences
  • Implement proper attention mechanisms
  • Apply label smoothing for generation tasks

Performance Optimization

  • Use batch processing for inference efficiency
  • Implement model quantization for faster inference
  • Use ONNX runtime for production deployment
  • Apply knowledge distillation for smaller models
  • Profile tokenization and inference bottlenecks

Error Handling and Validation

  • Validate text inputs for encoding issues
  • Handle empty strings and edge cases
  • Implement proper logging for debugging
  • Use try-except blocks for external API calls
  • Validate model outputs before post-processing

Dependencies

  • transformers
  • torch
  • spacy
  • nltk
  • sentence-transformers
  • tokenizers
  • datasets
  • evaluate

Key Conventions

  1. Always specify the model's maximum sequence length
  2. Use appropriate padding strategies (longest, max_length)
  3. Handle special characters and encoding issues early
  4. Document expected input/output formats clearly
  5. Use consistent preprocessing across training and inference
  6. Implement proper batching for production systems

Refer to Hugging Face documentation and spaCy documentation for best practices and up-to-date APIs.

Repository Stats

Stars0
Forks0
LicenseApache License 2.0