Raw data is always messy. Inconsistent formatting, duplicates, stale records, and garbage inputs break every workflow downstream. Learn to clean data at scale using Bitscale.
DE-3 · Data & Enrichment · 100 XP · ~18 min
Garbage in, garbage out. It’s the oldest rule in data work — and the most consistently ignored.Data comes in with inconsistent capitalization, broken company names, duplicate records, titles that haven’t been held in three years, and phone numbers in 12 different formats. Every one of these problems compounds downstream: bad personalization, failed validation, wrong routing, broken integrations.This module covers the cleaning operations that should run on every dataset before enrichment begins.
Clean this person's name: {{raw_name}}Rules:- Proper case for first and last name (capitalize first letter only)- Remove titles, suffixes, and parenthetical notes from the name field- Remove extra spaces- Return format: {"first_name": "...", "last_name": "..."}If the name appears to be a company name or email address, return {"first_name": "REVIEW", "last_name": "REVIEW"}
Standardize this job title: {{raw_title}}Rules:- Proper case- Expand common abbreviations: VP → Vice President, Dir → Director, Mgr → Manager- Remove symbols and special characters- Keep the core title, remove department suffixes if overly longReturn ONLY the cleaned title.
Normalize this company name: {{raw_company}}Rules:- Remove legal suffixes if redundant (Inc, Corp, Ltd at the end) UNLESS the name is very short and removing it would cause confusion- Proper case- Remove trailing/leading spaces- Expand obvious abbreviations only if unambiguousReturn ONLY the normalized company name.
Validate the format of this email address: {{raw_email}}Check for:- Valid format (has @, valid TLD, no spaces)- Common typos (double dots, missing TLD, space in address)Return: {"valid": true/false, "cleaned_email": "corrected version or original", "issue": "description of problem or null"}
Given this contact record:Name: {{first_name}} {{last_name}}Email: {{email}}Company: {{company_name}}Title: {{job_title}}And this existing record:Name: {{existing_first}} {{existing_last}}Email: {{existing_email}}Company: {{existing_company}}Title: {{existing_title}}Are these the same person? Consider: same name + same company = very likely duplicate; same email = definitive duplicate.Return: {"is_duplicate": true/false, "confidence": "high/medium/low", "reason": "brief explanation"}
A clean record with stale data is worse than a messy record with current data. Job titles change every 18–24 months on average in SaaS. If your list was built 6 months ago, 15–25% of titles may be wrong.Freshness check column:
Based on this contact's LinkedIn URL {{linkedin_url}}, estimate data freshness:(Use web research if available, otherwise estimate based on the following)Current data in our record:- Title: {{job_title}}- Company: {{company_name}}Questions to assess:1. Is this company still active?2. Is this title realistic for this company's size?3. Any signals this contact has left (e.g., if you can check LinkedIn)?Return: {"likely_current": true/false, "confidence": "high/medium/low", "notes": "brief reason"}
For lists older than 3 months, run a freshness check on your highest-value segments before enrichment.
After the cleaning layer, add a composite quality score for each row:
Score this contact record's data quality based on:- email_valid: {{email_valid}} (true/false)- name_normalized: {{name_clean}} (cleaned/REVIEW)- title_complete: {{title}} (not empty/empty)- company_normalized: {{company}} (not empty/empty)- is_duplicate: {{is_duplicate}} (true/false)- freshness_likely: {{likely_current}} (true/false)Rules:- Start at 100- Deduct 30 if email_valid is false- Deduct 20 if is_duplicate is true- Deduct 15 if name contains REVIEW- Deduct 15 if title is empty- Deduct 10 if freshness_likely is false- Deduct 5 if company is emptyReturn ONLY the final score (0-100).
Use this score to route records:
80–100: proceed to enrichment
60–79: clean before enrichment
Below 60: manual review or discard
Quick Check: What are the six most common data quality problems? What order should you run cleaning operations? What quality score threshold should trigger manual review?
Download a sample messy dataset (we’ll provide one via the challenge form, or use your own) and build the full cleaning pipeline in Bitscale.Requirements:
All 5 cleaning columns implemented
Duplicate detection column
Data quality score column
Record the starting vs. ending quality score distribution
A short paragraph on the most common data quality issues you found
Submit DE-3 Challenge →
Share your grid + quality score distribution (before/after). +100 XP on approval.
Next: DE-4 — Company Intelligence →
With clean contact data in hand, DE-4 adds the company context layer — firmographics, growth signals, and competitive intelligence.