Why does cloning from VSTS return old unreferenced objects?

UPDATE (2017-08-09):

We rolled out commit reachability bitmap indexes to VSTS and removed the clone cheat mentioned below. Cloning will no longer download unreachable objects! . We still don't have true object-level git gc on the server yet, but clone sizes will be smaller now.

TFS on-prem will get these changes in v.Next (not in any TFS 2017 updates, but the next major release). As Brian Harry mentioned, we should have a release candidate for v.Next in a few weeks.

We'll probably expand on this in future blog posts, but unlike core Git, we use Roaring bitmaps instead of EWAH bitmaps.  Daniel Lemire has some great blog posts and publications on bitmap indexes that we greatly enjoyed and benefited from.

Original Post: 

Note: "core Git" refers to the official base Git implementation, as opposed to Visual Studio or GitHub, or VSTS, which may involve non-standard implementations or behavior.

A customer asked:

We removed some unwanted binaries from our repo on visualstudio.com by following the instructions at https://help.github.com/articles/remove-sensitive-data/. We force-pushed to master and deleted all our other branches.

After running git gc locally, our local repo is now 5 MB, but git clone from visualstudio.com still returns 100MB. The old unreferenced blobs are still being sent down by the server.

How do we git gc (or some equivalent) on the server as well?

There are two issues here:

  1. There is no equivalent to git gc on VSTS yet.

    Our server preserves the history of every ref/branch update to Git repos, including deleted branches. This is analogous to the "reflog" in core Git. On VSTS, we expose the reflog via the REST API and the Branch Updates (i.e. pushes) tab in Web Access. Similarly to core Git, objects in the reflog are still considered to be referenced and will not be deleted by git gc. Core Git can eventually prune old reflog entries via git prune or git gc, but VSTS does not have that functionality yet.

  2. Large fetches are expensive for the server to calculate, so we cheat a little.

    Large fetches (and clones) have historically been very expensive in both core Git and VSTS due to the "counting objects" phase. http://githubengineering.com/counting-objects/ has a nice explanation of the problem, as well as how core Git and GitHub have (cleverly) improved the perf w/ bitmap indexes.

    Unfortunately, VSTS does not have that perf fix yet. Instead, it cheats a bit and blindly streams back every object that exists on the server if the client has nothing and is asks for all branches and tags (e.g. for git clone). This is generally reasonable, until a user decides to dereference most of the objects in their repo to save space!

I suspect that the customer would not have minded the lack of gc in his scenario if we only sent reachable objects during clone.

Until these issues are fixed for VSTS, what workarounds are there?

  • Delete the repo from the server (EDIT: or rename it) and re-push it.

    This works, but is sub-optimal.  In the new repo, you won't be able to see old pull request details, branch update history, and any links from other areas like builds or work items.

  • Trick the server by not cloning everything at once:

     mkdir newRepo
    git init
    git remote add origin 
    #fetch one branch first
    git fetch origin master
    #fetch everything else
    git fetch origin