dataset-comparer
from dkyazzentwatwa/chatgpt-skills
My comprehensive, tested + audited, library of skills to use for ChatGPT.
7 stars0 forksUpdated Dec 17, 2025
npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill dataset-comparerSKILL.md
Dataset Comparer
Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
Features
- Row Comparison: Find added, removed, and matching rows
- Value Changes: Detect changed values in matching rows
- Column Comparison: Identify schema differences
- Statistics: Summary of differences
- Diff Reports: HTML, CSV, and JSON output
- Flexible Matching: Compare by key columns or row position
Quick Start
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")
# Compare by key column
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
# Generate report
comparer.generate_report("diff_report.html")
CLI Usage
# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv
# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id
# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date
# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json
# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
API Reference
DatasetComparer Class
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> str
Comparison Methods
Key-Based Comparison
Compare rows by matching key columns (like primary keys):
diff = comparer.compare(key_columns=["customer_id"])
# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
Position-Based Comparison
Compare rows by their position (row number):
diff = comparer.compare() # No keys = positional comparison
Output Format
Comparison Result
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}
Changed Row Details
changes = comparer.get_changed_rows()
# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value
Schema Comparison
Compare column structure:
schema = comparer.compare_schema()
# Returns:
{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
Filtering Options
Ignore Columns
Skip certain columns during comparison:
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)
Compare Specific Columns
Only compare selected columns:
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)
Report Formats
HTML Report
comparer.generate_report("diff_report.html", format="html")
Features:
- Summary statistics
- Interactive tables
- Color-coded changes (gree
...
Repository Stats
Stars7
Forks0