Inside GitHub’s Code Universe: What Does 873 GB of Source Code Really Mean?

September 08, 2025

"Inside GitHub’s Code Universe: What Does 873 GB of Source Code Really Mean?"

✅ 1. What Does “873 GB of Source Code” Mean?

This refers to the total disk storage space consumed by source code files that were collected in a specific GitHub dataset known as the CodeParrot GitHub Code dataset. That means:

When all the code files are downloaded, extracted, and stored, their combined size (on disk) is about 873 gigabytes (GB).
This does not include things like videos, binary files, or large compiled executables—only source code files (like .py, .js, .cpp, .java, etc.).

📁 2. Where Does This 873 GB Come From?

This number comes from the CodeParrot dataset, a project by Hugging Face that:

Crawled public GitHub repositories
Collected source code files in over 50 programming languages
Filtered out non-code files like images or compiled binaries
Focused on files that were public and licensed appropriately

⚠️ Important:

This is only a subset of GitHub — mainly public repositories as of a specific snapshot in time (early 2021).
It does not include:

Private repositories
Deleted or archived projects
GitHub metadata (issues, pull requests, wikis, etc.)

So the 873 GB is just a snapshot, and the real size of all code on GitHub (as of August 2025) would be significantly larger.

🧾 3. What Is in Those 873 GB?

According to the CodeParrot project:

Language	Approximate Size	File Count
Python	52 GB	~7.2 million
JavaScript	92 GB	~11.4 million
HTML	69 GB	~10.3 million
Java	65 GB	~9.6 million
C++	34 GB	~4.3 million
JSON/XML/YAML	~100+ GB total	~millions

👉 These sizes include:

Actual code
Comments
Whitespace
Sometimes small configuration or metadata files

It also includes duplicate files (e.g., forks or copied codebases) unless filtered out.

🧮 4. How Does 873 GB Translate to Lines of Code?

There is no 1:1 ratio, but you can estimate:

A simple Python file may be ~2 KB and contain ~40–100 lines.
A large project (like a JavaScript framework) could be thousands of files totaling tens of megabytes.
On average, assuming each file is ~8 KB, 873 GB (873,000,000 KB) / 8 KB = ~109 million files

📌 Estimated lines of code?
Assuming ~50–100 lines per file:
→ 5.4 to 10.9 billion lines of code in the dataset

💾 5. How Big Is 873 GB in Real Terms?

To give you a sense of how large this is:

Storage Device	Size Comparison
USB drive	Would need ~9 x 128 GB drives
External HDD	Would fit in 1 standard 1TB drive
Cloud storage	~$8–12/month to store on AWS S3
Printout (paper)	Would require millions of pages of paper

🧠 6. Why Does This Matter?

Understanding the size of code on GitHub gives insight into:

The massive scale of open-source software
The amount of data needed to train AI code models (like GitHub Copilot, ChatGPT, etc.)
The diversity of programming languages and use cases on GitHub
The challenges of storing, analyzing, and indexing this volume of code

🚀 Final Thoughts

Key Point	Explanation
873 GB	Approx. total size of source code files collected from GitHub (subset)
Public repositories only	Snapshot used for machine learning/code research
Billions of lines of code	Likely range of total lines in the dataset
Vast language diversity	Includes Python, JavaScript, C++, Java, etc.
Used in AI training	Datasets like this are used for tools like Copilot, ChatGPT Code, etc.

Search This Blog

Sameer Naik