That's interesting. I'm designing a GIT-inspired global data storage system myself.
My views differ from your (firmly stated) vision, so I decided to design my own system (let's call it "Memory") rather trying to change GIT.
All information forms a global world-scale graph of connected data blobs. (Well, somewhat disjoint graph.)
Repositories are just local caches of the global sub-graph. Just bags of blobs with some "reference" metadata like branches etc.
Less focus on files, folders and commits. The main entity is the general data blob. Links can represent the connection between blobs (like "blob2 is based on blob1"). Properties can add metadata to a blob (time, etc).
GIT is based on a set of file tree "snapshots" attached to the commit graph spine. Apart from the linking to the unchanged parts, the only connection between the snapshots is via the commits graph.
In the Memory system, the main commit "spine" is not that important and is not the only link between data. When you change the file data, several different links/relation are created. The link between blob versions, the link between file versions, the link between directory version etc. When committing these changes a parallel set of commit-parts are created (well, there is a single commit object and links that connect it to every change link).
You can work in a subdirectory and checkout/commit/push only it. Two people pushing changes to two different subdirs don't conflict with each other. (Someone then may propagate their changes to the repo root.)
The repo split operation is basically no-op.
Submodules/subtrees/externals are elementary.
Repo merges/grafts are very easy. (Just add the new links and add them to the "point of view" data).
You can base files on the files in any repo in the world, preserving the history without copying it.
Committed and pushed a gigabyte file to a widely used repo? Just delete it, commit, push and then ask the maintainers to physically delete the blob file from the repo (cache). Repo is just a cache of a sub-graph of the global world graph - dead links are normal.
I'm thinking beyond the files and filesystems.
Think about torrents, DHT and magnet links. Imagine filling the local cache not from a specific repo, but from "the Internet".
Now think about the Internet and hyperlinking. Imagine hyperlinks have corresponding "magnet links" to the content. Information mirroring. No more dead link (unless nobody is interested in mirroring the data). Linking to the data, not to the place.
That's my vision.
Apart from the opaque blobs, all structures are hierarchical data. You mentioned XML, but my first thought was LISP's S-expressions and their serializations like Canonical S-expressions (http://en.wikipedia.org/wiki/Canonical_S-expressions
) . It's interesting that torrents use a similar serialization system (http://en.wikipedia.org/wiki/Bencode
) for the torrent metadata storage to make the torrent hash stable.
I want to use a hashing system that allows (inline) data to be interchangeable with its hash reference. So, the tree (hierarchical data) equal to the same tree where some subtree was replaced by its hash reference. You can physically deconstruct a tree into a set of nodes without changing the hash. You can view the structural data as a set of parts as well as a whole.