Git History Rewrite at Scale: Removing 100MB+ Files Safely

Introduction

Large files inside Git repositories are a silent problem. They increase clone times, inflate repository size, and in platforms like Bitbucket Cloud, can completely block pushes once files exceed 100MB.

During a migration exercise, we encountered multiple repositories containing large binary files embedded directly in Git history. Some were intentionally added during testing; others were legacy artifacts. Regardless of origin, the impact was the same: repository growth, push failures, and migration risk.

We needed a scalable, production-safe solution to:

  • Identify files larger than 100MB
  • Preserve those files safely
  • Remove them from Git history
  • Maintain traceability
  • Avoid Git LFS
  • Process multiple repositories in batch

This article explains the approach, implementation, and verification process.

The Problem with Large Files in Git

Git is optimized for source code, not large binaries. When a file larger than 100MB is committed:

  • It becomes part of Git object history.
  • Even if later deleted, the blob remains in history.
  • Every clone downloads that blob.
  • Bitbucket Cloud blocks pushes containing files ≥100MB.
  • Repository size increases permanently unless history is rewritten.

Deleting the file in a new commit is not enough. The blob must be removed from the entire commit graph.

Are you ready for limitless expansion? Take advantage of seamless cloud migration services designed to accelerate your business growth today.

Requirements

We defined clear technical requirements:

  1. Scan multiple repositories under a parent directory.
  2. Detect files larger than 100MB in:
    • Working directory
    • Full Git history
  3. Generate a detailed CSV audit report.
  4. Back up repositories before modification.
  5. Archive large blobs to S3 before removal.
  6. Rewrite Git history safely.
  7. Force push cleaned repositories.
  8. Verify that no large blobs remain.

Architecture Overview

The cleanup workflow followed this structure:

Implementation Strategy

1. Repository Discovery

All repositories were discovered under a specified parent directory by locating .git folders. This allowed batch processing without hardcoding repository names.

2. Scanning Git History

To detect large blobs in history, we relied on Git’s object database:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)'

We filtered for blobs larger than 100MB (104857600 bytes). This approach ensures that even deleted historical files are detected.

3. CSV Report Generation

For traceability, a consolidated CSV report was generated containing:

  • Repository name
  • File path
  • File size (bytes + human-readable)
  • Blob hash
  • Commit hash
  • Target S3 path

This report served as:

  • A dry-run validation artifact
  • An audit record
  • A mapping between Git history and S3 storage

4. S3 Archival Before Deletion

Before rewriting history, large blobs were extracted using:

git cat-file -p <blob-hash>

They were uploaded to S3 using a structured path:

s3://<bucket>/<repo-name>/<commit-hash>/<file-name>

This ensured:

  • No data loss
  • Full traceability
  • Commit-level mapping
  • Easy retrieval if required

5. Safe History Rewrite

We used git-filter-repo, which is the modern, recommended alternative to git filter-branch.

For each repository:

  • Paths to remove were collected from the CSV report.
  • git-filter-repo --invert-paths was used to remove those paths from all commits.
  • A full backup tarball was created before execution.
  • A confirmation prompt prevented accidental execution.

This resulted in a new, clean commit graph with large blobs removed.

6. Force Push and Remote Restoration

Since history was rewritten:

  • All commit hashes changed.
  • A force push was required.
  • Team members were instructed to re-clone repositories.

Remote URLs were preserved or restored automatically to ensure push continuity.

Verification Method

Cleanup was verified using a direct scan of Git objects:

git rev-list--objects--all \
|git cat-file--batch-check='%(objecttype) %(objectname) %(objectsize)' \
|awk'$1 == "blob" && $3 >= 104857600'

If the command returned no output, it confirmed:

  • No blob ≥100MB remained
  • History rewrite was successful
  • Repository was safe to push and clone

This verification step is critical and should never be skipped.

Results

  • 10 repositories processed
  • ~3GB of large blobs identified
  • All large files archived to S3
  • Git history rewritten safely
  • Force push completed
  • No remaining blobs ≥100MB in any repository
  • Repositories ready for clean migration

Lessons Learned

  1. Deleting files in a commit does not remove them from history.
  2. Always run a dry-run before destructive operations.
  3. Always create a full backup before rewriting history.
  4. Always verify using Git’s object database.
  5. Separate source code storage from binary artifact storage.
  6. Avoid large binary commits unless using Git LFS intentionally.

Best Practices for Production

  • Keep repositories focused on source code.
  • Use external storage (S3, artifact repositories) for large binaries.
  • Automate detection of large files in CI pipelines.
  • Add pre-commit or pre-receive hooks to block oversized files.
  • Regularly audit repository object sizes.

Conclusion

Rewriting Git history at scale is a high-impact, high-risk operation if not handled properly. However, with a structured approach, proper backups, archival strategy, and verification, it becomes a controlled and repeatable process.

By combining Git object analysis, S3 archival, and git-filter-repo, we successfully removed large files from multiple repositories without data loss and without relying on Git LFS.

This approach provides a scalable blueprint for teams facing similar migration or repository health challenges.

Related Searches

Related Solutions

Top 10 Lessons Learned from Failed Cloud Migrations: What Went Wrong?

What separates successful cloud migrations from disastrous ones?

Cloud migration is often hailed as the ultimate solution for scalability and cost-efficiency, yet many companies find themselves stuck in rising costs, security vulnerabilities, or operational chaos. Why do these failures happen?

Is it poor planning, unrealistic expectations, or simply the wrong approach? In this blog, we’ll uncover the stories behind failed cloud migrations, the lessons they teach, and how businesses can turn potential disaster into long-term success.

10 Lessons From Past Cloud Migration Failures

Cloud migration has transformed how businesses operate, but not all journeys to the cloud have been smooth. Many migrations fail due to common pitfalls that, if identified early, can be avoided. Here are 10 crucial lessons derived from past cloud migration failures:

Continue reading “Top 10 Lessons Learned from Failed Cloud Migrations: What Went Wrong?”

Smart Strategies for Implementing Hybrid Clouds

Join us on this enlightening journey as we unlock the secrets of implementing hybrid clouds the smart way. Let’s dive in!

Hybrid cloud, a combination of public and private cloud environments, has emerged as a powerful approach that offers the best of both worlds – flexibility, scalability and cost-effectiveness. However, implementing a hybrid cloud architecture requires careful planning, thoughtful strategy and a deep understanding of the organization’s specific requirements.

In this comprehensive blog, we’ll delve into the world of hybrid clouds, exploring the benefits, challenges and key considerations that come with integrating and managing a hybrid cloud environment.

Embracing a hybrid cloud model can transform the way your organization operates, unlocking new possibilities for innovation, agility and competitive advantage. Join us on this enlightening journey as we unlock the secrets of implementing hybrid clouds the smart way. Let’s dive in!

What is Hybrid Cloud?

Hybrid cloud is a cloud computing model that combines the use of both public cloud services and private cloud or on-premises infrastructure. In a hybrid cloud setup, organizations integrate their on-premises data centers, private clouds or other private infrastructure with public cloud resources provided by third-party cloud service providers such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform. These service providers help enterprises in hybrid cloud implementation and cloud migration by offering their cloud migration services.

Continue reading “Smart Strategies for Implementing Hybrid Clouds”

The Art of Cloud Bursting: Techniques & Strategies for Scaling Your Applications

Cloud bursting is a technique used by organizations to dynamically expand their computing resources from a private cloud to a public cloud when there is a sudden increase in demand for computing resources. With cloud bursting, organizations can handle unexpected spikes in demand without investing in additional hardware, which can be costly. This technique has become increasingly popular in recent years, as organizations seek to improve their scalability and cost-effectiveness. 

Understanding cloud bursting can help you optimize your computing resources and improve the performance of your applications and streamline the cloud implementation process. Here, in this blog, we will explore the benefits and challenges of cloud bursting, as well as some strategies for implementing it effectively. Continue reading “The Art of Cloud Bursting: Techniques & Strategies for Scaling Your Applications”