Tarballs – Why?

More and more I begin to wonder why we generate tarballs at all these days. Is it just because it’s easy – a function of “make distcheck”? There’s certainly value in the actual distcheck process to ensure you have a sane build, but why actually distribute the tarball? What’s the meaningful difference between a tarball and a git tag?

Now, I won’t even touch on the subject of how badly I want to throw autotools in the trash, but we’re so entrenched in its ways, and are comfortable with its quirks that energy is better spent on actual improvements, so for now the distcheck process is here to stay. For now.

So I ask a very serious question, others have asked as well – why publish tarballs? Most users get their packages in binary form from their distribution. Most users who build from source I would argue are using git already, or have git installed on their system, or can easily do so. Providing instructions on cloning/checkout out the tag/building using autogen/autoreconf/etc can be provided easily and clearly.

I migrated Banshee to Linode and consequently from Apache to lighttpd about a month ago. The logs start on June 20, 2010:

    $ grep -E 'banshee-1.+\.tar\.(gz|bz2)' \
        download.banshee.fm.access.log | wc -l
    7066

So in one month, we’ve only had 7066 tarball downloads, and that accounts for any and all released versions of Banshee over the past 5 years. Certainly the bulk of those downloads would be version 1.6.1, since that was the newest available tarball over the last month. 284 of those downloads were version 1.7.3, released less than 24 hours ago. I could generate better statistics, but that’s not the point here. The point is that number is pretty small compared to the reach of the distributions.

I roughly estimate the average size of a Banshee tarball (bzip2) is 3MB. Eliminating tarballs would save us 20GB/mo in bandwidth – and that’s during a quiet time in development when the servers are less active (1.6.1 was released in May). We’ll be seeing a spike I’ll be monitoring around 1.7.3.

So, if we ditched tarballs, how would you be affected? Would you care?

Update: to clarify a few things, you would still build and install like normal. For instance:

$ ./autogen.sh --prefix=$HOME/local --disable-whatever \
    && make && make install

Packagers would however have an additional minor burden. If their package system (e.g. rpm) requires an archive (e.g. can’t build directly from git), then the packager would be responsible for creating an archive. They could either just archive the git clone directory, or actually run their own “make distcheck” from their clone. It would be up to the packager to best integrate the git clone into whatever system they are using.

This entry was posted in uncategorized and tagged , , , , , . Bookmark the permalink.

37 Responses to Tarballs – Why?

  1. Rémi Cardona says:

    How would distro package Banshee then?

  2. mpiroc says:

    Tarballs are great when I’m working on a server that I don’t have admin rights on. I can just install with –prefix=$HOME, and I’m good to go. Granted, I would probably never do this for an app like Banshee, but tarballs have their uses.

  3. Crusher says:

    First: I am not going to defend the tar-ball here since i like the git repositories more.

    But where does the bandwidth go when tar-balls are eliminated? I would estimate the users will download the files from the git but that does cost more bandwidth because it is uncompressed.

  4. @Rémi the packager would create their own tarball from the git checkout if necessary – this can be as simple as tar cfj banshee.tar.bz2 banshee-git-checkout; it would depend on the requirements of a given packaging system. For instance, we would currently need to do this with openSUSE.

    @mpiroc that functionality would not go away. You would clone the git repository, checkout to whatever release tag you desire, and run ./autogen.sh –prefix=$HOME; make; make install … same as usual

  5. @Crusher fair point, but we’re not paying directly for the git bandwidth. That’s absorbed, graciously, by the GNOME Foundation. We’re very thankful for that support from GNOME. The Banshee project itself pays for the web hosting. Granted, we could probably move our tarballs to the GNOME FTP site as well.

  6. bochecha says:

    “he packager would create their own tarball from the git checkout if necessary – this can be as simple as tar cfj banshee.tar.bz2 banshee-git-checkout; it would depend on the requirements of a given packaging system. For instance, we would currently need to do this with openSUSE.”

    We have a requirement in Fedora that the tarball included in a source RPM must be the exact same one as the one provided by upstream (i.e. sha1/md5sum must match).

    Can you guarantee that creating tarballs from the same git tag will always return an archive with the same hash?

  7. @bochecha probably not for the tarball itself, since the timestamps of the files would be different across clones I would imagine, but the contents of the tarball should easily be verifiable.

    I would step back and ask why Fedora has this requirement. I would argue we all need to move forward. Tarballs and the arcane autotools system have been around for decades, and exist in their insane complexity to unify systems that no longer exist. We can do better.

  8. Expanding on what Rémi said, if you expect distributions to make tarballs, every distributor is going to have to roll his/her own tarballs (as I understand it, most of the source packages from binary distributions require this anyway). Your tarballs are saving distributors from having to replicate this effort.

    Having a single tarball is also provides a single point of security (you can make arguments for and against this, but I think in general one would be for) – a single signed tarball (and checks from distributors) means fewer points for attackers to inject malicious code.

    Also, tarballs are useful for people who are not online 100% of the time or are behind unreliable connections. Git sucks at handling unreliable connections, and there are lots of parts of the world which don’t have reliable, always-on, bandwidth-unlimited connectivity.

    Finally, as a distribution mechanism, tarballs are *much* easy to mirror. If, for example, all of GNOME were to switch to git+tags for distribution, anybody who wanted to mirror GNOME releases would need to support a more complex git-based infrastructure instead of a simple HTTP/FTP server.

  9. David says:

    Security reasons – take the possibility of someone getting into the repository and modifying some of the source files – a tarball is a static view in time that, once a checksum was made, gives you an easily verifiable reference.

  10. abcd says:

    @Aaron Just to put those numbers in perspective, you missed all the mirrors of that tarball (and there usually are some, for instance Gentoo mirrors the source tarball for everything it packages). Because those downloads are not coming directly from your site, you aren’t counting them.

  11. Atri says:

    Would it not be okay to host tarballs at http://sourceforge.net/ ?

  12. Anon says:

    Not forcing version control tools on people reduces the number of tools required to try your program. Sometimes I don’t need the history of changes when I getting a piece of software to build. It also saves people having to grovel around your repos directory to work out what they need to build your program (you can pre-strip the tarball). Finally it’s nice to know how big things are going to be before you get them.

    At the end of the day it comes down to who are you trying to make things easier for? Yourself (reasonable) or people who want to try/package your software (who can be unreasonable)?

  13. Joe Buck says:

    Tarballs are needed in a variety of situations: for running a program on a machine that is running an older distribution, or a less popular operating system/distro flavor, or, as others have said, on a machine where you don’t have root, or where you need multiple versions of a program to coexist (though this is vastly easier to achieve with development tools or other programs that take a command line and turn files into other files, than it is with Gnome software, because of the incompatible config file issue).

  14. Sankar says:

    This will be the ideal situation if the only version control system is git. There are numerous version control systems – cvs, svn, git, hg etc.

    It will be a nightmare for packagers to accomodate them all in their build systems. Working with one format – tarballs makes it easy for them. It just so happens that tarballs predate git and so everyone wants to use them.

    May be a good idea will be to patch the OBS such that they are intelligent enough to understand vcs-protocol parameters for Source tag (git:/// , svn:/// etc.) while parsing .spec files This will help in SuSE packagers saving some time for not having to download the tarballs. But there is no universal way for this atm, such that all buildsystems (.deb etc.) can make use of.

  15. Alexander Larsson says:

    Many modules have more demanding requirements for the build system that creates the tarball, for things like generating files that will be shipping with the tarballs. Doing this at the developers site means its easier for ordinary people to build the app. It also means that you’re guaranteed that everyone uses the same generated files, independent of what version of the extra tools the compiling person has installed.

  16. One nice thing (“advantage” might be saying too much) is that a `make distcheck`d tarball has a pre-generated configure script and Makefile.in files; there is no need for auto* when building the tarball, just make and the always necessary compilers.

    This might not matter, though, as installing auto* is trivial on most distros. Is there any particular reason to avoid auto* for tarball users? I can’t think of a compelling reason, offhand…

  17. Calvin Walton says:

    One thing to consider is the (opinionated, highly visible, minority?) Gentoo users. Gentoo provides a very heavy set of mirror servers of tar files, but they do not mirror git(/cvs/svn/hg) repositories. (The end result: Every Gentoo user uses your tar file, but it doesn’t count in your download statistics). You really don’t want Gentoo users hitting your git server directly for installations.

    Gentoo developers have dealt with packages that don’t have proper tar releases before. They work around this by building their own private tar of a package, which is then distributed just to Gentoo users on the mirror system. This obviously puts a bit more work on the packager, and makes updates to the latest version slower. But it’s doable.

    With autotools, one thing to consider here is that the tarball created with make dist or distcheck includes a configure script and makefile built with the versions of autoconf and automake that are present on the maintainer’s box, and are thus well-tested. If you rely on every person who builds the package to re-run autogen.sh, that adds another place for annoying build bugs to pop in.

  18. Jorge Castro says:

    @Anon “At the end of the day it comes down to who are you trying to make things easier for? Yourself (reasonable) or people who want to try/package your software (who can be unreasonable)?”

    We have things like the OBS and Launchpad that gives us the ability to make dailies, stable releases, and unstable releases – if anything this lowers the bar for getting people testing things.

  19. Dodji says:

    VCSes come and go. There are many different VCSes today, and I don’t really see why there won’t be new VCSes coming out in the future. So just assuming \git\ is not necessarily a future proof bet. As a disclaimer, I must say that I am *huge* git fan myself, but there are unfortunately times where I think it’s appropriate that I put my fanboy hat aside :-)

    I think using tarballs as the canonical way to distribute source code is a reasonable way to keep easy and sustainable access to our packages. By easy, I mean that downloading a tarball is something that is quite simple to do from any type of computer system you can think of today. By sustainable I mean there is a good chance that something released as a tarball will stay easily accessible for download tomorrow.

    Now with my fanboy hat on, if everything were in git, it would truly *rock*.

  20. Rick O says:

    The couchdbx project on github uses a shell script to clone the appropriate repos and build from there. I love it, and think it’s great. In fact, I liked it so much that I stole the technique to write a shell script-based installer for git-flow. I say ditch the tarballs.

  21. What is the number for the other kind of files?

  22. Simon says:

    @Calvin – not just Gentoo users, but source distros and non-distros in general. I’m an LFS user personally, and use tarballs exclusively.

    One observation – a tarball on an HTTP or FTP server can be accessed trivially with almost any tool you might name, browsers, command-line clients. Checking code out from Git or SVN is a great deal more difficult, requiring specific tools, which unlike a web browser, tend to be flaky when proxies are involved.

  23. jpobst says:

    It is obviously host specific, but a nice feature of GitHub is when you create a tag, it automagically creates a tarball of that tag and places it on your downloads page. However, for reasons I do not understand, several packagers let me know that this “wasn’t good enough” for them.

  24. Simon says:

    @jpobst – probably because those tarballs are just snapshots of the repository, not proper distribution packages The difference is whether the autogen.sh script has been run in advance, since running the full autotools stack can require extra development packages, stuff like gnome-common or equivalent Xorg macros. It’s more work, and a bit more fragile.

  25. Matt says:

    I think you should be comparing number of tarball downloads versus number of git checkouts, rather than against number of end users. How do those numbers stack up?

  26. Roger says:

    I think you’re mostly all coming to this from the perspective of people who are very firmly enmeshed in open source development and community. I’d say that tarballs are friendlier for people at the fringes.

  27. antimonio says:

    @Arun Raghavan “Expanding on what Rémi said, if you expect distributions to make tarballs, every distributor is going to have to roll his/her own tarballs (as I understand it, most of the source packages from binary distributions require this anyway). Your tarballs are saving distributors from having to replicate this effort.”

    Maybe it would be interesting to creat a functionality in git to generate a tarball from a certain git tag? That would be the best of both worlds.

    “Having a single tarball is also provides a single point of security (you can make arguments for and against this, but I think in general one would be for) – a single signed tarball (and checks from distributors) means fewer points for attackers to inject malicious code.”

    By the enhancement request I mentioned above, all tarballs generated from a git tag would be the same, same md5sum so this wouldn’t be a problem.

    “Also, tarballs are useful for people who are not online 100% of the time or are behind unreliable connections. Git sucks at handling unreliable connections, and there are lots of parts of the world which don’t have reliable, always-on, bandwidth-unlimited connectivity.”

    Well, but nowadays this people that don’t have reliable connections are not likely to be packagers or anything, so they would probably get their packages from their distribution. Anyway see below for my next rant:

    “Finally, as a distribution mechanism, tarballs are *much* easy to mirror. If, for example, all of GNOME were to switch to git+tags for distribution, anybody who wanted to mirror GNOME releases would need to support a more complex git-based infrastructure instead of a simple HTTP/FTP server.”

    Agreed. And let’s talk again about the feature I mentioned above. If done correctly, it wouldn’t make you download more than the tarball (no git log cruft at all) so this would be a win-win.

  28. Rémi Cardona says:

    Aaron, quick follow up :)

    Running make dist for gnome packages usually requires more packages than a regular install. gtk-doc, autoconf, automake, ${VCS}, etc.

    Like Dodji said, tarballs are good for releases and setting things in stone. Once a release is done and the tarball is up, you don’t need to think about it anymore and you should modify them.

    If you’re really worried about bandwidth cost, why not host your tarballs on http://ftp.gnome.org or sourceforge? It’s their job to provide such services for the OSS communities.

    But do realize that a lot of distros do rely on tarballs for their own packages, for both convenience and security reasons.

    Cheers

  29. Edd says:

    Could gnome git not provide something like github’s tarball downloads feature? (It’s been there since 2008! http://github.com/blog/12-tarball-downloads)

    @jpobst: did the packagers say why this wasn’t good enough?

  30. Emilio Pozuelo Monfort says:

    Yes, I’ll miss them. Not everybody uses git. What do you expect, people to checkout a git tag for one project, a mercurial one for another, a bazaar, subversion, cvs, monotone, etc one for other projects?

    Tarballs are the canonical way to distribute software. Please don’t stop releasing them. If you worry about bandwidth, host them on http://ftp.gnome.org. There’s no problem with that.

  31. Spudd86 says:

    Gentoo… gentoo builds stuff from source, and one of the places it might obtain a tarball from is the original developers site, it NEEDS this tarball to always have the same hash (most of the time it comes from gentoo mirrors, but for say an ebuild the user has written for themselves to get a newer version, or for something gentoo doesn’t package the only place the tarball will be is on the original release site) see: http://blog.flameeyes.eu/2009/05/09/i-still-dislike-github

    Also it’s not really nice to make more work for packagers in general, they’re overworked already…

  32. Matt says:

    I agree with what Spudd86 wrote. I’ll add a bit on in the same general vein:

    While you might argue that distributions packaging your software independently would lead to less chance of malware being inadvertently included, I would counter that they’d be even less inclined to scrutinise tarballs, having been burdened with packaging tarballs for each and every piece of software–something that they currently need not do. The developer Spudd86 linked to the blog of is a prime example of packagers who definitely do not need an extra burden–a workaholic and I gauge holding a large part of Gentoo together, suffering health problems but, due to the sheer amount of work and the fundamental nature of some aspects thereof, essential for the project (and us, as Gentoo users, of course).

    Furthermore, the chance of any individual user of a compromised package finding out about said compromise and thus being able to uncompromise their machine would be lower with the tarball being used by fewer people–after all, we would have many unique tarballs, all differing in their lineage (from repository to user) and hashes. In the situation of one, single tarball being packaged upstream and then checked and signed by each and every distribution, as is now the status quo, a single user finding their tarball containing malware would enable distributions and news websites to post comprehensive warnings for users to clean and upgrade to a non-compromised version of the software concerned. This was the case with a programme not long back, though my news reader appears to have deleted it now, so you’ll have to trust me on that. :P

    You might argue that you, being the author of the programme concerned, do not wish to expend further effort through packaging and that you have no ethical obligation to do so in place of distributions. That could be the case–though I do not profess to have thought through that possibility enough to say with any certainty one way or the other. However, it would seem consistent to follow from the general FOSS ethos and philosophy that doing a little extra–that is, building, hosting and distributing source tarballs–is doing a little extra good through which a relatively little effort is expended in favour of reducing the overall work needed by all involved, developers, distributions and users alike. I don’t mean to presume your stance, just a little independent aside.

  33. cjk says:

    git transfers (over git:// at least) do compression, so the 3rd comment by Crusher is not quite true.

    The point of tarballs – or archives for that matter – is because distro build systems usually do not allow outgoing network connections. Nor would it make sense to always redownload the repository on every build – the openSUSE BS starts one for each PRP (distro,arch) tuple.

  34. As of Gentoo users – while distributing sources by gentoo devs is doable it is *not* for those who want to use brand new banshee. Testing banshee 1.7.x by gentoo users – forget it as noone would host tarball (and it is too big for bugzilla).

  35. On *BSD, Package maintainers do a git/svn/cvs checkout. Make an SRC tarball and host it on their respective sites. When the user does a make install. It would download it from the package maintainers website and do the install normally. I still think that going forward, having a VCS checkout would be ideal.

  36. behdad says:

    Hey. Behdad here. Please excuse me opining without reading all the comments first.

    So, there’s two things a tarball provides:

    – A single file to download instead of “git clone –depth 1″, and making sure a git server is running on the other side to begin with, and that git is available locally. Sure, “git dist” comes handy addressing that. And Fedora recently switched to being able to build packages out of “git dist” tarballs instead of “make dist” ones.

    – “make dist” tarballs have some generated files that supposedly need a few more advanced tools than your users care to keep around. This is a really huge advantage if you have ever tried to build software on exotic or simply old systems. Say, on your hosting provider’s Linus system. The difference is huge: it’s like having to build vala first (and all its dependencies of course) or not have to, if you just want to compile a simple vala-based project.

    So, I’ve not lost all hope in tarballs just yet. But you do have a point.

    behdad

    PS. Enjoy GUADEC.

  37. I stand corrected (and I would like more to be the same) because of some latest discoveries I’ve found from git features:
    - About the assumed higher amount of data to be downloaded from a tag instead of a tarball: as a previous commenter said, it’s sent compressed; and anyway I just found out the command –depth which lets you not download any history.
    - About the assumed loss of security: it turns out git lets you sign tags!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>