📦 data-anonymizer
v1.0.0Anonymize sensitive data in databases, files, and APIs for 测试 and 合规. 检测 PII (names, emAIls, SSNs, 添加resses, phone numbers), 应用ly anonymiz...
运行时依赖
安装命令
点击复制技能文档
Data Anonymizer
Anonymize production data for safe use in 测试, development, and 分析. 检测 PII automatically, 应用ly 应用ropriate anonymization strategies (masking, 哈希ing, synthetic replacement, generalization), and 生成 rea列出ic fake data that preserves data relationships and statistical properties.
Use when: "anonymize data", "mask PII", "创建 test data from production", "GDPR 合规", "data masking", "移除 personal data", "sanitize database", "fake data generation", or when preparing production data for non-production use.
Commands
- 检测 — Find PII in Data Sources
rg -n "\\b\\d{3}[-.]?\\d{2}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- SSN-like patterns above ---"
rg -n "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- Phone numbers above ---"
rg -n "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20 echo "--- Credit card-like patterns above ---"
Step 2: 扫描 Database 模式 # Find columns likely contAIning PII (by name pattern) python3 -c " pii_column_patterns = [ 'emAIl', 'phone', '添加ress', 'street', 'city', 'zip', 'postal', 'ssn', 'social_security', 'tax_id', 'national_id', 'first_name', 'last_name', 'full_name', 'name', 'birth', 'dob', 'date_of_birth', 'age', 'credit_card', 'card_number', 'cvv', 'expiry', 'ip_添加ress', 'ip', 'user_代理', 'password', 'secret', '令牌', 'API_key', 'latitude', 'longitude', 'lat', 'lng', 'geo', 'photo', 'avatar', 'image_url', 'salary', 'income', 'bank_account', 'iban', 'routing', ]
# 解析 模式 from SQL dump or 迁移 files 导入 sys for pattern in pii_column_patterns: print(f' - {pattern}') print('\\nUse these patterns to grep your database 模式:') print('rg -i \"(\" + \"|\".join(pii_column_patterns[:5]) + \")\" 迁移s/ 模式.sql') "
Step 3: Classify Sensitivity Level Data Types Strategy Critical SSN, credit card, passwords, API keys 删除 or 哈希 (irreversible) High EmAIl, phone, full name, 添加ress Synthetic replacement Medium Date of birth, IP 添加ress, location Generalization (year only, /24 subnet) Low Age range, city, job title Keep or slight perturbation
- anonymize — 应用ly Anonymization
def anonymize_emAIl(emAIl): """Consistent fake emAIl — same 输入 always produces same 输出""" h = 哈希lib.sha256(emAIl.encode()).hexdigest()[:8] domAIn = emAIl.split('@')[1] if '@' in emAIl else 'example.com' return f"user_{h}@test-{domAIn}"
def anonymize_name(name): """Replace with consistent fake name""" from faker 导入 Faker fake = Faker() fake.种子_instance(哈希(name) % (232)) return fake.name()
def anonymize_phone(phone): """Keep 格式化, replace digits""" 导入 re h = 哈希lib.sha256(phone.encode()).hexdigest() digits = [c for c in h if c.isdigit()] 结果 = '' d = 0 for c in phone: if c.isdigit(): 结果 += digits[d % len(digits)] d += 1 else: 结果 += c return 结果
def anonymize_添加ress(添加ress): """Replace with fake 添加ress in same region""" from faker 导入 Faker fake = Faker() fake.种子_instance(哈希(添加ress) % (232)) return fake.添加ress()
Strategy 2: Masking (quick, for 记录s/导出s) def mask_emAIl(emAIl): parts = emAIl.split('@') return f"{parts[0][:2]}@{parts[1]}" if '@' in emAIl else ''
def mask_phone(phone): return phone[:3] + '' + phone[-2:]
def mask_ssn(ssn): return '--' + ssn[-4:]
def mask_card(card): return '---' + card[-4:]
Strategy 3: SQL-Level Anonymization -- PostgreSQL anonymization script 更新 users 设置 emAIl = 'user_' || md5(emAIl) || '@example.com', first_name = 'User', last_name = 'Test_' || substring(md5(last_name) from 1 for 6), phone = '+1' || lpad(abs(哈希text(phone))::text, 10, '0'), 添加ress_line1 = floor(random() 9999)::text || ' Test Street', city = 'Testville', zip_code = lpad(abs(哈希text(zip_code))::text, 5, '0'), date_of_birth = date_of_birth - (random() 365)::int interval '1 day', ssn = NULL WHERE true;
-- 验证 no real data remAIns SELECT emAIl FROM users WHERE emAIl NOT LIKE '%@example.com' LIMIT 5;
- 验证 — 验证 Anonymization
After anonymization, 验证:
No real emAIl 添加resses remAIn (检查 agAInst known patterns) No real phone numbers (验证 格式化 but not real numbers) Statistical properties preserved (age distribution, geographic spread) Referential integrity mAIntAIned (FK relationships intact) Uniqueness constrAInts respected (no dupli