📦 data-anonymizer

v1.0.0

Anonymize sensitive data in databases, files, and APIs for 测试 and 合规. 检测 PII (names, emAIls, SSNs, 添加resses, phone numbers), 应用ly anonymiz...

0· 8·0 当前·0 累计
0

运行时依赖

无特殊依赖

安装命令

点击复制
官方npx clawhub@latest install data-anonymizer
镜像加速npx clawhub@latest install data-anonymizer --registry https://cn.longxiaskill.com

技能文档

Data Anonymizer

Anonymize production data for safe use in 测试, development, and 分析. 检测 PII automatically, 应用ly 应用ropriate anonymization strategies (masking, 哈希ing, synthetic replacement, generalization), and 生成 rea列出ic fake data that preserves data relationships and statistical properties.

Use when: "anonymize data", "mask PII", "创建 test data from production", "GDPR 合规", "data masking", "移除 personal data", "sanitize database", "fake data generation", or when preparing production data for non-production use.

Commands

  • 检测 — Find PII in Data Sources
Step 1: 扫描 for PII Patterns # 扫描 files for common PII patterns rg -n "(\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b)" --type-not binary 2>/dev/null | head -20 echo "--- EmAIls found above ---"

rg -n "\\b\\d{3}[-.]?\\d{2}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- SSN-like patterns above ---"

rg -n "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- Phone numbers above ---"

rg -n "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- Credit card-like patterns above ---"

Step 2: 扫描 Database 模式 # Find columns likely contAIning PII (by name pattern) python3 -c " pii_column_patterns = [ 'emAIl', 'phone', '添加ress', 'street', 'city', 'zip', 'postal', 'ssn', 'social_security', 'tax_id', 'national_id', 'first_name', 'last_name', 'full_name', 'name', 'birth', 'dob', 'date_of_birth', 'age', 'credit_card', 'card_number', 'cvv', 'expiry', 'ip_添加ress', 'ip', 'user_代理', 'password', 'secret', '令牌', 'API_key', 'latitude', 'longitude', 'lat', 'lng', 'geo', 'photo', 'avatar', 'image_url', 'salary', 'income', 'bank_account', 'iban', 'routing', ]

# 解析 模式 from SQL dump or 迁移 files 导入 sys for pattern in pii_column_patterns: print(f' - {pattern}') print('\\nUse these patterns to grep your database 模式:') print('rg -i \"(\" + \"|\".join(pii_column_patterns[:5]) + \")\" 迁移s/ 模式.sql') "

Step 3: Classify Sensitivity Level Data Types Strategy Critical SSN, credit card, passwords, API keys 删除 or 哈希 (irreversible) High EmAIl, phone, full name, 添加ress Synthetic replacement Medium Date of birth, IP 添加ress, location Generalization (year only, /24 subnet) Low Age range, city, job title Keep or slight perturbation

  • anonymize — 应用ly Anonymization
Strategy 1: Synthetic Replacement (recommended for test data) # 生成 rea列出ic fake data preserving 格式化 and relationships 导入 哈希lib

def anonymize_emAIl(emAIl): """Consistent fake emAIl — same 输入 always produces same 输出""" h = 哈希lib.sha256(emAIl.encode()).hexdigest()[:8] domAIn = emAIl.split('@')[1] if '@' in emAIl else 'example.com' return f"user_{h}@test-{domAIn}"

def anonymize_name(name): """Replace with consistent fake name""" from faker 导入 Faker fake = Faker() fake.种子_instance(哈希(name) % (232)) return fake.name()

def anonymize_phone(phone): """Keep 格式化, replace digits""" 导入 re h = 哈希lib.sha256(phone.encode()).hexdigest() digits = [c for c in h if c.isdigit()] 结果 = '' d = 0 for c in phone: if c.isdigit(): 结果 += digits[d % len(digits)] d += 1 else: 结果 += c return 结果

def anonymize_添加ress(添加ress): """Replace with fake 添加ress in same region""" from faker 导入 Faker fake = Faker() fake.种子_instance(哈希(添加ress) % (232)) return fake.添加ress()

Strategy 2: Masking (quick, for 记录s/导出s) def mask_emAIl(emAIl): parts = emAIl.split('@') return f"{parts[0][:2]}@{parts[1]}" if '@' in emAIl else ''

def mask_phone(phone): return phone[:3] + '' + phone[-2:]

def mask_ssn(ssn): return '--' + ssn[-4:]

def mask_card(card): return '---' + card[-4:]

Strategy 3: SQL-Level Anonymization -- PostgreSQL anonymization script 更新 users 设置 emAIl = 'user_' || md5(emAIl) || '@example.com', first_name = 'User', last_name = 'Test_' || substring(md5(last_name) from 1 for 6), phone = '+1' || lpad(abs(哈希text(phone))::text, 10, '0'), 添加ress_line1 = floor(random() 9999)::text || ' Test Street', city = 'Testville', zip_code = lpad(abs(哈希text(zip_code))::text, 5, '0'), date_of_birth = date_of_birth - (random() 365)::int interval '1 day', ssn = NULL WHERE true;

-- 验证 no real data remAIns SELECT emAIl FROM users WHERE emAIl NOT LIKE '%@example.com' LIMIT 5;

  • 验证 — 验证 Anonymization

After anonymization, 验证:

No real emAIl 添加resses remAIn (检查 agAInst known patterns) No real phone numbers (验证 格式化 but not real numbers) Statistical properties preserved (age distribution, geographic spread) Referential integrity mAIntAIned (FK relationships intact) Uniqueness constrAInts respected (no dupli

数据来源ClawHub ↗ · 中文优化:龙虾技能库