Why does cloning from VSTS return old unreferenced objects?
We rolled out commit reachability bitmap indexes to VSTS and removed the clone cheat mentioned below. Cloning will no longer download unreachable objects! . We still don't have true object-level
git gc on the server yet, but clone sizes will be smaller now.
TFS on-prem will get these changes in v.Next (not in any TFS 2017 updates, but the next major release). As Brian Harry mentioned, we should have a release candidate for v.Next in a few weeks.
We'll probably expand on this in future blog posts, but unlike core Git, we use Roaring bitmaps instead of EWAH bitmaps. Daniel Lemire has some great blog posts and publications on bitmap indexes that we greatly enjoyed and benefited from.
Note: "core Git" refers to the official base Git implementation, as opposed to Visual Studio or GitHub, or VSTS, which may involve non-standard implementations or behavior.
A customer asked:
We removed some unwanted binaries from our repo on visualstudio.com by following the instructions at https://help.github.com/articles/remove-sensitive-data/. We force-pushed to master and deleted all our other branches.
git gclocally, our local repo is now 5 MB, but git clone from visualstudio.com still returns 100MB. The old unreferenced blobs are still being sent down by the server.
How do we
git gc(or some equivalent) on the server as well?
There are two issues here:
There is no equivalent to
git gcon VSTS yet.
Our server preserves the history of every ref/branch update to Git repos, including deleted branches. This is analogous to the "reflog" in core Git. On VSTS, we expose the reflog via the REST API and the Branch Updates (i.e. pushes) tab in Web Access. Similarly to core Git, objects in the reflog are still considered to be referenced and will not be deleted by
git gc. Core Git can eventually prune old reflog entries via
git gc, but VSTS does not have that functionality yet.
Large fetches are expensive for the server to calculate, so we cheat a little.
Large fetches (and clones) have historically been very expensive in both core Git and VSTS due to the "counting objects" phase. http://githubengineering.com/counting-objects/ has a nice explanation of the problem, as well as how core Git and GitHub have (cleverly) improved the perf w/ bitmap indexes.
Unfortunately, VSTS does not have that perf fix yet. Instead, it cheats a bit and blindly streams back every object that exists on the server if the client has nothing and is asks for all branches and tags (e.g. for git clone). This is generally reasonable, until a user decides to dereference most of the objects in their repo to save space!
I suspect that the customer would not have minded the lack of gc in his scenario if we only sent reachable objects during clone.
Until these issues are fixed for VSTS, what workarounds are there?
Delete the repo from the server (EDIT: or rename it) and re-push it.
This works, but is sub-optimal. In the new repo, you won't be able to see old pull request details, branch update history, and any links from other areas like builds or work items.
Trick the server by not cloning everything at once:
mkdir newRepo git init git remote add origin #fetch one branch first git fetch origin master #fetch everything else git fetch origin