|
| 1 | +From: Junio C Hamano <junkio@cox.net> |
| 2 | +Subject: Re: Make "git clone" less of a deathly quiet experience |
| 3 | +Date: Sun, 12 Feb 2006 19:36:41 -0800 |
| 4 | +Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net> |
| 5 | +References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org> |
| 6 | + <7vwtg2o37c.fsf@assigned-by-dhcp.cox.net> |
| 7 | + <Pine.LNX.4.64.0602110943170.3691@g5.osdl.org> |
| 8 | + <1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se> |
| 9 | + <1139717510.4183.34.camel@evo.keithp.com> |
| 10 | + <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com> |
| 11 | +Content-Type: text/plain; charset=us-ascii |
| 12 | +Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>, |
| 13 | + Linus Torvalds <torvalds@osdl.org>, |
| 14 | + Git Mailing List <git@vger.kernel.org>, |
| 15 | + Petr Baudis <pasky@suse.cz> |
| 16 | +Return-path: <git-owner@vger.kernel.org> |
| 17 | +In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com> |
| 18 | + (Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300") |
| 19 | + |
| 20 | +Martin Langhoff <martin.langhoff@gmail.com> writes: |
| 21 | + |
| 22 | +> +1... there should be an easy-to-compute threshold trigger to say -- |
| 23 | +> hey, let's quit being smart and send this client the packs we got and |
| 24 | +> get it over with. Or perhaps a client flag so large projects can |
| 25 | +> recommend that uses do their initial clone with --gimme-all-packs? |
| 26 | + |
| 27 | +What upload-pack does boils down to: |
| 28 | + |
| 29 | + * find out the latest of what client has and what client asked. |
| 30 | + |
| 31 | + * run "rev-list --objects ^client ours" to make a list of |
| 32 | + objects client needs. The actual command line has multiple |
| 33 | + "clients" to exclude what is unneeded to be sent, and |
| 34 | + multiple "ours" to include refs asked. When you are doing |
| 35 | + a full clone, ^client is empty and ours is essentially |
| 36 | + --all. |
| 37 | + |
| 38 | + * feed that output to "pack-objects --stdout" and send out |
| 39 | + the result. |
| 40 | + |
| 41 | +If you run this command: |
| 42 | + |
| 43 | + $ git-rev-list --objects --all | |
| 44 | + git-pack-objects --stdout >/dev/null |
| 45 | + |
| 46 | +It would say some things. The phases of operations are: |
| 47 | + |
| 48 | + Generating pack... |
| 49 | + Counting objects XXXX... |
| 50 | + Done counting XXXX objects. |
| 51 | + Packing XXXXX objects..... |
| 52 | + |
| 53 | +Phase (1). Between the time it says "Generating pack..." upto |
| 54 | +"Done counting XXXX objects.", the time is spent by rev-list to |
| 55 | +list up all the objects to be sent out. |
| 56 | + |
| 57 | +Phase (2). After that, it tries to make decision what object to |
| 58 | +delta against what other object, while twenty or so dots are |
| 59 | +printed after "Packing XXXXX objects." (see #git irc log a |
| 60 | +couple of days ago; Linus describes how pack building works). |
| 61 | + |
| 62 | +Phase (3). After the dot stops, the program becomes silent. |
| 63 | +That is where it actually does delta compression and writeout. |
| 64 | + |
| 65 | +You would notice that quite a lot of time is spent in all |
| 66 | +phases. |
| 67 | + |
| 68 | +There is an internal hook to create full repository pack inside |
| 69 | +upload-pack (which is what runs on the other end when you run |
| 70 | +fetch-pack or clone-pack), but it works slightly differently |
| 71 | +from what you are suggesting, in that it still tries to do the |
| 72 | +"correct" thing. It still runs "rev-list --objects --all", so |
| 73 | +"dangling objects" are never sent out. |
| 74 | + |
| 75 | +We could cheat in all phases to speed things up, at the expense |
| 76 | +of ending up sending excess objects. So let's pretend we |
| 77 | +decided to treat everything in .git/objects/packs/pack-* (and |
| 78 | +the ones found in alternates as well) have interesting objects |
| 79 | +for the cloner. |
| 80 | + |
| 81 | +(1) This part unfortunately cannot be totally eliminated. By |
| 82 | + assume all packs are interesting, we could use the object |
| 83 | + names from the pack index, which is a lot cheaper than |
| 84 | + rev-list object traversal. We still need to run rev-list |
| 85 | + --objects --all --unpacked to pick up loose objects we would |
| 86 | + not be able to tell by looking at the pack index to cover |
| 87 | + the rest. |
| 88 | + |
| 89 | + This however needs to be done in conjunction with the second |
| 90 | + phase change. pack-objects depends on the hint rev-list |
| 91 | + --objects output gives it to group the blobs and trees with |
| 92 | + the same pathnames together, and that greatly affects the |
| 93 | + packing efficiency. Unfortunately pack index does not have |
| 94 | + that information -- it does not know type, nor pathnames. |
| 95 | + Type is relatively cheap to obtain but pathnames for blob |
| 96 | + objects are inherently unavailable. |
| 97 | + |
| 98 | +(2) This part can be mostly eliminated for already packed |
| 99 | + objects, because we have already decided to cheat by sending |
| 100 | + everything, so we can just reuse how objects are deltified |
| 101 | + in existing packs. It still needs to be done for loose |
| 102 | + objects we collected to fill the gap in (1). |
| 103 | + |
| 104 | +(3) This also can be sped up by reusing what are already in |
| 105 | + packs. Pack index records starting (but not end) offset of |
| 106 | + each object in the pack, so we can sort by offset to find |
| 107 | + out which part of the existing pack corresponds to what |
| 108 | + object, to reorder the objects in the final pack. This |
| 109 | + needs to be done somewhat carefully to preserve the locality |
| 110 | + of objects (again, see #git log). The deltifying and |
| 111 | + compressing for loose objects cannot be avoided. |
| 112 | + |
| 113 | + While we are writing things out in (3), we need to keep |
| 114 | + track of running SHA1 sum of what we write out so that we |
| 115 | + can fill out the correct checksum at the end, but I am |
| 116 | + guessing that is relatively cheap compared to the |
| 117 | + deltification and compression cost we are currently paying |
| 118 | + in this phase. |
| 119 | + |
| 120 | +NB. In the #git log, Linus made it sound like I am clueless |
| 121 | +about how pack is generated, but if you check commit 9d5ab96, |
| 122 | +the "recency of delta is inherited from base", one of the tricks |
| 123 | +that have a big performance impact, was done by me ;-). |
| 124 | + |
| 125 | + |
0 commit comments