# Why Google stores billions of lines of code in a single repository

## Meta Info

Presented in [Communications of the ACM 2016](https://doi.org/10.1145/2854146).

Authors: Rachel Potvin, Josh Levenberg (*Google*)

## Understanding the paper

### TL;DR

* Google chose to stick with the central repository due to its advantages.
* The monolithic model of source code management is *not for everyone*, e.g., organizations where large parts of the codebase are *private* or *hidden* between groups.

### Key systems

* **Piper**: The distributed source-code repository
  * Implemented on top of standard Google infrastructure (originally Bigtable, now Spanner)
  * Reply on the Paxos algorithm to guarantee consistency across replicas
* **CitC** (Clients in the Cloud): The workspace client
  * With *a cloud-based storage backend* and *a Linux-only FUSE13 file system*
* **Critique**: The code-review tool
* **Tricorder**: Static analysis system
  * Code quality, test coverage, and test results
* **Rosie**: large-scale cleanups and code changes
  1. Create a large patch; find-and-replace
  2. Split the large patch into smaller patches; test them independently; send for code review; commit them automatically once they pass tests and a code review

### Statistics

* Google’s monolithic software repository is used by 95% of its software developers worldwide.
* The Google codebase includes
  * approximately *1 billion files*
  * a history of *35 million commits*
  * 86TB of data (excluding release branches)
* Over 99% of files stored in Piper are visible to all full-time Google engineers.
* Over 80% of Piper users today use CitC.

### Advantages of a monolithic codebase

* Unified versioning → a single source of truth
* Code sharing and reuse
* Simplified dependency management
  * Avoid *diamond dependency problem*
* Atomic changes
* Large-scale refactoring
* Collaboration across teams
* Flexible team boundaries and code ownership
* Code visibility and clear tree structure → implicit team namespacing

### Costs and trade-offs

* Tooling investments for both development and execution
  * Code-indexing system
  * Automated test infrastructure
  * Build infrastructure
  * Code search and browsing tools
* Codebase complexity
  * Unnecessary dependencies → binary size bloating
* Efforts invested in code health

### Alternatives

* Git (distributed version control systems)
  * A team at Google is focused on supporting Git, which is used by *Google’s Android and Chrome teams* outside the main Google repository.
  * Important for these teams due to *external partner and open source collaborations*.
  * The Git community strongly suggests and prefers developers have *more and smaller repositories*.
    * Git-clone will copy all content to one’s local machine.
* Mercurial
  * An experimental effort


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/journal/communications-of-the-acm/2015/why-google-stores-billions-of-lines-of-code-in-a-single-repository.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
