When building enterprise applications, few features seem as simple yet prove as treacherous as "Bulk Import via CSV."
It starts innocently:
- Read a file
- Loop through rows
- Insert into database
But reality quickly intrudes:
- What happens when row 4,999 of 5,000 fails?
- Do you rollback everything? Continue processing?
- If you retry the file, do you end up with 4,999 duplicates?
This post explores industry-standard strategies for handling stream processing of data into a database, focusing on the transition from a naive implementation to a robust, idempotent system.
The Problem: "Silent Failures" and Partial State
Imagine a system processing event occurrence data. The initial implementation might look like this:
async function processFile(rows) {
for (const row of rows) {
try {
await createOccurrence(row);
} catch (error) {
console.error("Row failed", error);
}
}
await markFileStatus('COMPLETED');
}
What's wrong here?
- ❌ Continues processing after failures
- ❌ Marks file as "Completed" even if 50% of rows failed
- ❌ No way to identify which rows succeeded or failed
- ❌ Re-uploading creates duplicates of successful rows
This approach leads to data drift: The system reports "Success," but the database is missing records. The operations team has no visibility into failures, and retrying the file risks creating duplicates.
Strategy 1: The "All-or-Nothing" Transaction (ACID)
For critical financial data or strictly relational data, partial success is unacceptable.
How It Works
Wrap the entire batch in a database transaction. If one row fails, the transaction aborts and rolls back to the state before the file was touched.
Pros
- Zero Data Corruption: Database is always in a valid state
- Simple Error Handling: File status is binary (Success or Fail)
Cons
- Performance: Long-running transactions lock resources
- Scale Limits: MongoDB transactions have size limits (16MB of oplog)
- Unfeasible for Large Files: Processing 50,000 rows in one transaction often fails
Strategy 2: "Fail-Fast" with Idempotency (The Balanced Approach)
This is the preferred approach for operational data like occurrences, logs, or inventory updates.
The Concept
- Process rows sequentially
- If any row fails, stop immediately and mark the file as
FAILED - Crucially: Ensure the processing logic is idempotent
What is Idempotency?
Idempotency means performing an operation multiple times has the same result as performing it once.
Non-Idempotent:
INSRT INTO occurrences (id) VALUES (auto_increment())
→ Running twice creates two records ❌
Idempotent:
INSERT INTO occurrences (event_id) VALUES ('EVT-123') ON CONFLICT DO NOTHING
→ Running twice creates one record ✅
Implementing the Solution
Here's how to refactor your file consumer to handle this robustly.
1. Use Stable Identifiers
Move away from generating new IDs on every run. Instead, use stable data from the CSV (like eventId or externalRef) to identify records.
2. Implement "Skip-Existing" Logic
Instead of complex "upsert" logic, simply check if the work is already done.
for (const row of allRecords) {
const existing = await OccurrenceModel.findOne({
$or: [
{ eventId: row.eventId },
{ externalRef: row.externalRef }
]
});
if (existing) {
logger.info(`Skipping existing occurrence: ${row.eventId}`);
continue;
}
try {
await createNewOccurrence(row);
} catch (err) {
logger.error(`Row failed: ${err.message}`);
throw err;
}
}
Key Points:
- ✅ Check for duplicates using stable identifiers
- ✅ Skip existing records (idempotency)
- ✅ Fail fast on errors
- ✅ Safe to retry the entire file
The Operational Workflow
With this architecture, the workflow becomes clear and safe:
Step-by-Step Example
- Upload: User uploads a file with 5,000 rows
- Failure: Row 101 has invalid data. System stops immediately
- Rows 1-100 are saved
- File status:
FAILED
- Fix: User corrects the data in row 101
- Retry: User uploads the corrected file
- Rows 1-100: System detects existing records (by
eventId) and skips them - Row 101: System processes successfully
- Rows 102-5000: System processes them
- Rows 1-100: System detects existing records (by
- Success: File status:
COMPLETED
Why This Works
- No duplicates: Idempotency ensures safe retries
- Clear failures: Operations team knows exactly what failed
- Partial progress preserved: No need to reprocess successful rows
- Simple recovery: Fix the issue and re-upload
Summary
When designing data ingestion systems, avoid the temptation to just "catch and log" errors to keep the loop moving.
Key Principles
- Fail loudly: Don't hide data issues
- Make it repeatable: Ensure your code handles re-running the same data without side effects
- Use stable keys: Rely on business keys (Event IDs, External Refs, SKUs) rather than auto-generated IDs for duplicate detection
By adopting these patterns, you transform a fragile import feature into a resilient system that your operations team can trust.
Quick Reference
| Approach | Use Case | Pros | Cons |
|---|---|---|---|
| ACID Transactions | Financial data, small batches | Zero corruption, simple | Performance issues, scale limits |
| Fail-Fast + Idempotency | Operational data, large files | Scalable, safe retries | Requires stable identifiers |