I'm writing this post mostly to my future self, not any specific project or piece of code I've seen other people write. That's not to say that I don't think this is something that probably applies to many projects. Sometimes it's easy to lose sight of what we're doing, and it's good to be reminded.
So to start at the beginning: I've been working on supergit, a Rust
library to parse git repositories. It's built on top of libgit2
(and the git2
rust bindings), and aims to create a more Rustic
interface and type fascade for git repositories. It also aims to
solve issues such as: rename detection, path-history, and subtree
management. I'm writing this library for octopus, which will
eventually host my monorepo.
In supergit
the main workflow is around iterating things, seeing as
git is an acyclical graph, and iterators are a decent way to view this
datastructure. But git graphs can get pretty big. I wanted the
iterator to be configurable in a way that allows someone to write a
tool that searches a whole repository history, while also making it
possible to step through a history 20 commits at a time (to implement
history pagination on a website, for example).
Looking at the current API, this is how you would implement the
latter, for a main
branch:
use supergit::Repository;
fn main() {
let path = ... // get your repository path somehow
let repo = Repository::open(path).unwrap();
let main = repo.get_branch("main").unwrap();
let iter = main.get(20);
iter.for_each(|c| {
println!("{}: {}", c.commit().id_str(), c.commit().summary());
});
}
That's easy enough, right? But wait, why am I calling .commit()
on
c
. Isn't it already a commit? Well...sort of. In supergit
, this
type is a BranchCommit
, because this is where things get
complicated.
Sort of like a tree, but not really
In git, rarely is a branch just a history of single commits. Maybe this is how some people think about their history, but it certainly has never been the case for any of the repositories that I work on. Basically the second you have more than one contributor, it's very common for a history to have merge-commits in it.
So how do we deal with that in an iterator? The design I chose was to
wrap a Commit
object in another type, which can convey this state.
BranchCommit
is an enum and has three variants: Commit
(maybe I
should rename that to Simple
or something?), Merge
, and Octopus
(if you don't know what an octopus merge is, don't worry about it.
Most people don't and they're very rare and weird).
What Merge
and Octopus
contain are new Branch
handles (the type
returned by get_branch()
), meaning that for every split it's now up
to the user to decide whether they want to continue first-parent
(i.e. only ever follow the main branch line, ignoring the history of
merged branches), or if they want to enumerate the histories as well.
Most importantly: for every branch merge, you get to re-decide what
your iterator strategy should be: infinate, limited by number, or
limited up to a certain commit-hash.
So far so good I thought, this is an okay enough interface for me to work with. But this is where some problems appeared.
File histories (and git internals)
(a slight de-tour through git - feel free to skip)
The main reason why I'm writing this more Rustic wrapper around
libgit2
is to make it easier to determine what the history of a file
has been. This is pretty simple to find out via the git CLI (git log
-- <your file here>
), but not something that libgit2
exposes,
because that's not how git stores data.
To git, all data is stored in a key-value store indexed by a SHA1
(soon to be SHA256 I think?) hash reference. That applies to files,
full file trees, and commits as well. Say we have a file acab.txt
,
we commit it and it gets the ID
da39a3ee5e6b4b0d3255bfef95601890afd80709
(the file ID, not the
commit ID!), but then we open it and write ACAB
in it, and commit
that again. Now the file ID is
99f069b8a0cbe4c9485a14fe50775d0f71deb4e7
. Both these files are
saved in the git object store, because after all you might want to go
back to the older version.
But here's the thing: from the actual commits we can get two things:
the file tree at the time of commit, and the commit parents. To
figure out what actually changed in the commit, you have to diff it
against it's parents, which is exactly what git show
does if you
give it a reference to a commit.
What this means is that if you want to have a library that grabs the history of a path, well you'll have to go through all commits, and check the tree for changes at that specific path. Furthermore, that won't actually let you know if a file has simply been renamed, only that it has changed. Further logic is required to figure out if the file is the same, but just has a different name.
And all of this is something that supergit
implements, behind a nice
Rustic API (I hope...).
Bloated abstractions
So I wrote a function that would, for a branch iterator, step along it
and check the history of a path, by diffing each commit with it's
parents, and tracking a path via the delta information in the diff.
But this is where I ran into problems. Because my iterator design
always chose the first-parent to step through. Other branches were
ignored, and because the function accepted an iterator and stepped it
internally, there was no way for my file_history()
function to
figure out the exact behaviour the user wanted.
My first instinct was to implement branching in the BranchIter
itself; allowing it to branch off, essentially pushing commits it
would have to get back to onto a stack, and resuming from a previous
position. That turned out to be a really bad idea.
It took me about an hour of banging my head against this abstraction before I realised that it wasn't meant to be. Sometimes systems are self-contained, and adding more functionality takes a considerable amount of effort, and begs the question, if it's really the right choice to make. Why add more functionality to an abstraction that works fine on it's own?
Instead, embrace composition, and add another layer on top, that can use the previous. You end up with a much more managable design, and data can flow from one layer to the next. Make sure that your interfaces are flexible enough to be re-used, but don't think that just because a component could technically be responsible for some work, that it really has to implement this work.
And that's it basically. Thanks for reading my ramblings about git and one of my side-projects. I hope I managed to make you think about the way you build systems a bit, and maybe next time you are in a situation similar to this one, don't be like me :)