Building a Code Analysis Tool: How the Pipeline Works | Blog

Performing comprehensive code analysis requires careful orchestration of multiple phases, whether you're running locally on your machine, remotely on a server, or as part of a CI/CD pipeline. Each phase plays a critical role in ensuring code quality, security, and maintainability. Let's explore how a modern workflow manager handles this complexity across different execution environments.

The 5 Phases of Code Analysis

A robust code analysis workflow consists of 5 distinct phases, each with its own time allocation and responsibilities:

Phase	Time Allocation	Purpose
Prepare	2%	Environment setup and file detection
Agent (AI)	25%	AI-powered code review
Code Analysis	30%	Static analysis and security scanning
Upload Results	2%	Publishing analysis artifacts
Cleanup Artifacts	70%	Resource cleanup and finalization

Code Analysis Workflow

Phase 1: Prepare

The preparation phase sets up the entire analysis pipeline. Despite taking only 2% of workflow time, it determines which tools run, how work parallelizes, and what files get scanned.

Step	Action	Purpose
Environment Init	Detect OS, shell, available tools	Adapt to execution context
Dependency Check	Verify analyzers, scanners, AI models	Ensure required tools available
Language Detection	Map file extensions to languages	Activate correct analyzers
File Discovery	Scan with include/exclude patterns	Build analysis file list
Smart Filtering	Remove binary, oversized, deleted files	Optimize analysis scope
Plan Construction	Calculate parallelism, estimate duration	Enable efficient execution

Phase 2: Agent (AI)

The AI Agent phase performs autonomous code review using Large Language Models. Taking 25% of workflow time, it acts like a human reviewer that understands context, explores the codebase, and applies your team's specific standards.

Step	Action	Output File
Setup & Checkout	Clone repository, configure environment	-
Download Tools	Fetch autofind CLI and ccrcli	-
Load Context	Read custom instructions, build metadata	`repo-custom-instructions.json`
Retrieve Changes	Get PR diff from API	`diff-file.diff`
Agentic Review	Claude Sonnet 4.5 explores codebase	`results-agent.json`
Security Audit	GPT-5 scans for vulnerabilities	`results-security-detector.json`
Tag & Upload	Label results by source, create artifacts	`results-agent.zip`

Unlike static analyzers that follow rigid rules, the AI agent actively investigates. It reads your custom instructions file to understand project-specific patterns (e.g., "No comments," "Use SafeAwait," "Decorator architecture"). Then it creates a review plan, uses tools to explore related files, validates changes against your standards, and performs a dedicated security pass. The output is tagged JSON artifacts ready for review comments.

Running Locally with Open Source Models: For complete data privacy, use local models like Llama 3, Mistral/Mixtral, Qwen, or DeepSeek. Host them with Ollama, LM Studio, or LocalAI on your GPU, then configure your agent to point to localhost:11434 instead of external APIs. This ensures zero data leaves your machine.

Phase 3: Code Analysis

The Code Analysis phase runs static analysis to find security vulnerabilities and logic errors. Taking 30% of workflow time, it uses CodeQL to scan the codebase with specific rules while filtering out noise.

Step	Action	Output File
Setup & Checkout	Clone repository at PR ref	-
Initialize CodeQL	Download CLI, configure queries	-
Run Analysis	Execute specific query rules	`results.sarif`
Convert SARIF	Transform to custom JSON format	`codeql-javascript-typescript-ccr.json`
Upload Artifact	Store results for downstream use	`results-codeql-javascript-typescript.zip`

CodeQL runs targeted checks like js/unreachable-statement (dead code), js/useless-comparison-test (always true/false), js/inconsistent-loop-direction (infinite loops), and js/use-of-returnless-function (void return usage). The configuration disables default queries and applies filters to exclude noisy warnings.

Results flow through two formats: First, CodeQL outputs SARIF (Static Analysis Results Interchange Format), an industry-standard JSON format used by security platforms and code quality tools. Then a custom script converts it to CCR JSON, a simplified format that strips complexity and keeps only essential info (file, line number, message) for AI consumption. The final artifact contains both formats.

Phase 4: Upload Results

The Upload Results phase aggregates findings from all analyzers and posts them to the PR. Taking only 2% of workflow time, it handles deduplication and delivers consolidated feedback.

Step	Action	Output
Initialize Secrets	Parse payload, set secure variables	Configured environment
Download Tools	Fetch ccrcli for API communication	CLI ready
Prepare Payload	Clean input data for callback	`payload.json`
Download Artifacts	Retrieve agent and CodeQL results	All findings collected
Run Deduplication	Compare and merge duplicate findings	Unique comments only
Execute Callback	Send results to PR via ccrcli	Comments posted

The phase downloads artifacts from both the AI Agent (results-agent) and CodeQL (results-codeql-*), then uses the autofind tool to detect and remove duplicate findings. If the AI already flagged an issue that CodeQL also found, only one comment appears. Finally, ccrcli sends the deduplicated results back to the Pull Request as review comments.

Phase 5: Cleanup Artifacts

The Cleanup phase removes temporary artifacts to free storage. Taking 70% of workflow time (mostly waiting), it uses the platform's API to delete all artifacts generated during the analysis.

Step	Action	Artifacts Deleted
List Artifacts	Query platform API for run artifacts	-
Delete Loop	Iterate and delete each artifact	`results-agent.zip`, `results-security-detector.zip`, `results-codeql-javascript-typescript.zip`
Confirm Cleanup	Log completion status	All temporary files removed

The cleanup script uses the platform's API to list all artifacts for the workflow run, pipes the IDs through xargs, and deletes each one via REST API calls. This prevents storage bloat from accumulating analysis results across hundreds of PRs. The phase runs last and takes the longest due to API rate limits and sequential deletion, but requires minimal compute resources.

Running Anywhere

The beauty of this architecture is its flexibility. The same workflow can execute:

Locally on your machine:

code-analyzer analyze --mode=local --path=./src

On a remote server:

code-analyzer analyze --mode=remote --endpoint=https://api.example.com

In your CI/CD pipeline:

- name: Run Code Analysis
  run: code-analyzer analyze --mode=cicd --pr=${{ github.event.number }}

The preparation phase adapts to each context, discovering files, detecting languages, and configuring analyzers appropriately.