Building a Code Analysis Tool: How the Pipeline Works

BA
Blog Post

Learn how code analysis tools work under the hood. From file detection to AI-powered reviews, see how each phase processes your code locally, remotely, or in CI/CD.

Performing comprehensive code analysis requires careful orchestration of multiple phases, whether you're running locally on your machine, remotely on a server, or as part of a CI/CD pipeline. Each phase plays a critical role in ensuring code quality, security, and maintainability. Let's explore how a modern workflow manager handles this complexity across different execution environments.

The 5 Phases of Code Analysis

A robust code analysis workflow consists of 5 distinct phases, each with its own time allocation and responsibilities:

PhaseTime AllocationPurpose
Prepare2%Environment setup and file detection
Agent (AI)25%AI-powered code review
Code Analysis30%Static analysis and security scanning
Upload Results2%Publishing analysis artifacts
Cleanup Artifacts70%Resource cleanup and finalization

Code Analysis Workflow

Phase 1: Prepare

The preparation phase sets up the entire analysis pipeline. Despite taking only 2% of workflow time, it determines which tools run, how work parallelizes, and what files get scanned.

StepActionPurpose
Environment InitDetect OS, shell, available toolsAdapt to execution context
Dependency CheckVerify analyzers, scanners, AI modelsEnsure required tools available
Language DetectionMap file extensions to languagesActivate correct analyzers
File DiscoveryScan with include/exclude patternsBuild analysis file list
Smart FilteringRemove binary, oversized, deleted filesOptimize analysis scope
Plan ConstructionCalculate parallelism, estimate durationEnable efficient execution

Phase 2: Agent (AI)

The AI Agent phase performs autonomous code review using Large Language Models. Taking 25% of workflow time, it acts like a human reviewer that understands context, explores the codebase, and applies your team's specific standards.

StepActionOutput File
Setup & CheckoutClone repository, configure environment-
Download ToolsFetch autofind CLI and ccrcli-
Load ContextRead custom instructions, build metadatarepo-custom-instructions.json
Retrieve ChangesGet PR diff from APIdiff-file.diff
Agentic ReviewClaude Sonnet 4.5 explores codebaseresults-agent.json
Security AuditGPT-5 scans for vulnerabilitiesresults-security-detector.json
Tag & UploadLabel results by source, create artifactsresults-agent.zip

Unlike static analyzers that follow rigid rules, the AI agent actively investigates. It reads your custom instructions file to understand project-specific patterns (e.g., "No comments," "Use SafeAwait," "Decorator architecture"). Then it creates a review plan, uses tools to explore related files, validates changes against your standards, and performs a dedicated security pass. The output is tagged JSON artifacts ready for review comments.

Running Locally with Open Source Models: For complete data privacy, use local models like Llama 3, Mistral/Mixtral, Qwen, or DeepSeek. Host them with Ollama, LM Studio, or LocalAI on your GPU, then configure your agent to point to localhost:11434 instead of external APIs. This ensures zero data leaves your machine.

Phase 3: Code Analysis

The Code Analysis phase runs static analysis to find security vulnerabilities and logic errors. Taking 30% of workflow time, it uses CodeQL to scan the codebase with specific rules while filtering out noise.

StepActionOutput File
Setup & CheckoutClone repository at PR ref-
Initialize CodeQLDownload CLI, configure queries-
Run AnalysisExecute specific query rulesresults.sarif
Convert SARIFTransform to custom JSON formatcodeql-javascript-typescript-ccr.json
Upload ArtifactStore results for downstream useresults-codeql-javascript-typescript.zip

CodeQL runs targeted checks like js/unreachable-statement (dead code), js/useless-comparison-test (always true/false), js/inconsistent-loop-direction (infinite loops), and js/use-of-returnless-function (void return usage). The configuration disables default queries and applies filters to exclude noisy warnings.

Results flow through two formats: First, CodeQL outputs SARIF (Static Analysis Results Interchange Format), an industry-standard JSON format used by security platforms and code quality tools. Then a custom script converts it to CCR JSON, a simplified format that strips complexity and keeps only essential info (file, line number, message) for AI consumption. The final artifact contains both formats.

Phase 4: Upload Results

The Upload Results phase aggregates findings from all analyzers and posts them to the PR. Taking only 2% of workflow time, it handles deduplication and delivers consolidated feedback.

StepActionOutput
Initialize SecretsParse payload, set secure variablesConfigured environment
Download ToolsFetch ccrcli for API communicationCLI ready
Prepare PayloadClean input data for callbackpayload.json
Download ArtifactsRetrieve agent and CodeQL resultsAll findings collected
Run DeduplicationCompare and merge duplicate findingsUnique comments only
Execute CallbackSend results to PR via ccrcliComments posted

The phase downloads artifacts from both the AI Agent (results-agent) and CodeQL (results-codeql-*), then uses the autofind tool to detect and remove duplicate findings. If the AI already flagged an issue that CodeQL also found, only one comment appears. Finally, ccrcli sends the deduplicated results back to the Pull Request as review comments.

Phase 5: Cleanup Artifacts

The Cleanup phase removes temporary artifacts to free storage. Taking 70% of workflow time (mostly waiting), it uses the platform's API to delete all artifacts generated during the analysis.

StepActionArtifacts Deleted
List ArtifactsQuery platform API for run artifacts-
Delete LoopIterate and delete each artifactresults-agent.zip, results-security-detector.zip, results-codeql-javascript-typescript.zip
Confirm CleanupLog completion statusAll temporary files removed

The cleanup script uses the platform's API to list all artifacts for the workflow run, pipes the IDs through xargs, and deletes each one via REST API calls. This prevents storage bloat from accumulating analysis results across hundreds of PRs. The phase runs last and takes the longest due to API rate limits and sequential deletion, but requires minimal compute resources.

Running Anywhere

The beauty of this architecture is its flexibility. The same workflow can execute:

Locally on your machine:

code-analyzer analyze --mode=local --path=./src

On a remote server:

code-analyzer analyze --mode=remote --endpoint=https://api.example.com

In your CI/CD pipeline:

- name: Run Code Analysis
  run: code-analyzer analyze --mode=cicd --pr=${{ github.event.number }}

The preparation phase adapts to each context, discovering files, detecting languages, and configuring analyzers appropriately.


Published on 12/2/2025by Claudio Teixeira