A Manifesto for Preserving Content on the Web
This Page is Designed to Last
By Jeff Huang, published 2019-12-19, updated 2021-08-24
The end of the year is an opportunity to clean up and reset for the upcoming new semester. I found myself clearing out old bookmarks—yes, bookmarks: that formerly beloved browser feature that seems to have lost the battle to 'address bar autocomplete'. But this nostalgic act of tidying led me to despair.
Bookmark after bookmark led to dead link after dead link. What's vanished: unique pieces of writing on kuro5hin about tech culture; a collection of mathematical puzzles and their associated discussion by academics that my father introduced me to; Woodman's Reverse Engineering tutorials from my high school years, where I first tasted the feeling of control over software; even my most recent bookmark, a series of posts on Google+ exposing usb-c chargers' non-compliance with the specification, all disappeared.
This is more than just link rot, it's the increasing complexity of keeping alive indie content on the web, leading to a reliance on platforms and time-sorted publication formats (blogs, feeds, tweets).
Of course, I have also contributed to the problem. A paper I published 7 years ago has an abstract that includes a demo link, which has been taken over by a spammy page with a pumpkin picture on it. Part of that lapse was laziness to avoid having to renew and keep a functioning web application up year after year.
I've recommended my students to push websites to Heroku, and publish portfolios on Wix. Yet every platform with irreplaceable content dies off some day. Geocities, LiveJournal, what.cd, now Yahoo Groups. One day, Medium, Twitter, and even hosting services like GitHub Pages will be plundered then discarded when they can no longer grow or cannot find a working business model.
The problem is multi-faceted. First, content takes effort to maintain. The content may need updating to remain relevant, and will eventually have to be rehosted. A lot of content, what used to be the vast majority of content, was put up by individuals. But individuals (maybe you?) lose interest, so one day maybe you just don't want to deal with migrating a website to a new hosting provider.
Second, a growing set of libraries and frameworks are making the web more sophisticated but also more complex. First came jquery, then bootstrap, npm, angular, grunt, webpack, and more. If you are a web developer who is keeping up with the latest, then that's not a problem.
But if not, maybe you are an embedded systems programmer or startup CTO or enterprise Java developer or chemistry PhD student, sure you could probably figure out how to set up some web server and toolchain, but will you keep this up year after year, decade after decade? Probably not, and when the next year when you encounter a package dependency problem or figure out how to regenerate your html files, you might just throw your hands up and zip up the files to deal with "later". Even simple technology stacks like static site generators (e.g., Jekyll) require a workflow and will stop working at some point. You fall into npm dependency hell, and forget the command to package a release. And having a website with multiple html pages is complex; how would you know how each page links to each other? index.html.old, Copy of about.html, index.html (1), nav.html?
Third, and this has been touted by others already (and even rebutted), the disappearance of the public web in favor of mobile and web apps, walled gardens (Facebook pages), just-in-time WebSockets loading, and AMP decreases the proportion of the web on the world wide web, which now seems more like a continental web than a "world wide web".
So for these problems, what can we do about it? It's not such a simple problem that can be solved in this one article. The Wayback Machine and archive.org helps keep some content around for longer. And sometimes an altruistic individual rehosts the content elsewhere.
But the solution needs to be multi-pronged. How do we make web content that can last and be maintained for at least 10 years? As someone studying human-computer interaction, I naturally think of the stakeholders we aren't supporting. Right now putting up web content is optimized for either the professional web developer (who use the latest frameworks and workflows) or the non-tech savvy user (who use a platform).
But I think we should consider both 1) the casual web content "maintainer", someone who doesn't constantly stay up to date with the latest web technologies, which means the website needs to have low maintenance needs; 2) and the crawlers who preserve the content and personal archivers, the "archiver", which means the website should be easy to save and interpret.
So my proposal is seven unconventional guidelines in how we handle websites designed to be informative, to make them easy to maintain and preserve. The guiding intention is that the maintainer will try to keep the website up for at least 10 years, maybe even 20 or 30 years. These are not controversial views necessarily, but are aspirations that are not mainstream—a manifesto for a long-lasting website.
- Return to vanilla HTML/CSS – I think we've reached the point where html/css is more powerful, and nicer to use than ever before. Instead of starting with a giant template filled with .js includes, it's now okay to just write plain HTML from scratch again. CSS Flexbox and Grid, canvas, Selectors, box-shadow, the video element, filter, etc. eliminate a lot of the need for JavaScript libraries. We can avoid jquery and bootstrap when they're not needed. The more libraries incorporated into the website, the more fragile it becomes. Skip the polyfills and CSS prefixes, and stick with the CSS attributes that work across all browsers. And frequently validate your HTML; it could save you a headache in the future when you encounter a bug.
- Don't minimize that HTML – minimizing (compressing) your HTML and associated CSS/JS seems like it saves precious bandwidth and all the big companies are doing it. But why not? Well, you don't save much because your web pages should be gzipped before being sent over the network, so preemptively shrinking your content probably doesn't do much to save bandwidth if anything at all. But even if it did save a few bytes (it's just text in the end), you now need to have a build process and to add this to your workflow, so updating a website just became more complex. If there's a bug or future incompatibility in the html, the minimized form is harder to debug. And it's unfriendly to your users; so many people got their start with HTML by smashing that View Source button, and minimizing your HTML prevents this ideal of learning by seeing what they did. Minimizing HTML does not preserve its educational quality, and what gets archived is only the resulting codejunk.
- Prefer one page over several – several pages are hard to maintain. You can lose track of which pages link to what, and it also leads to some system of page templates to reduce redundancy. How many pages can one person really maintain? Having one file, probably just an index.html, is simple and unforgettable. Make use of that infinite vertical scroll. You never have to dig around your files or grep to see where some content lies. And how should your version control that file? Should you use git? Shove them in an 'old/' folder? Well I like the simple approach of naming old files with the date they are retired, like index.20191213.html. Using the ISO format of the date makes it so that it sorts easily, and there's no confusion between American and European date formats. If I have multiple versions in one day, I would use a style similar to that which is customary in log files, of index.20191213.1.html. A nice side effect is then you can access an older version of the file if you remember the date, without logging into the web host.
- End all forms of hotlinking – this cautionary word seems to have disappeared from internet vocabulary, but it's one of the reasons I've seen a perfectly good website fall apart for no reason. Stop directly including images from other websites, stop "borrowing" stylesheets by just linking to them, and especially stop linking to JavaScript files, even the ones hosted by the original developers. Hotlinking is usually considered rude since your visitors use someone else's bandwidth, it makes the user experience slower, you let another website track your users, and worse of all if the location you're linking to changes their folder structure or just goes offline, then the failure cascades to your website as well. Google Analytics is unnecessary; store your own server logs and set up GoAccess or cut them up however you like, giving you more detailed statistics. Don't give away your logs to Google for free.
- Stick with native fonts – we're focusing on content first, so decorative and unusual typefaces are completely unnecessary. Stick with either the 13 web-safe fonts or a system font stack that matches the default font to the operating system of your visitor. Using the system font stack might look a bit different between operating systems, but your layout shouldn't be so brittle that an extra word wrap will ruin it. Then you don't have to worry about the flashing font problem either. Your focus should be about delivering the content to the user effectively and making the choice of font be invisible, rather than getting noticed to stroke your design ego.
- Obsessively compress your images – faster for your users, less space to archive, and easier to maintain when you don't have to back up a humongous folder. Your images can have the same high quality, but be smaller. Minify your SVGs, losslessly compress your PNGs, generate JPEGs to exactly fit the width of the image. It's worth spending some time figuring out the most optimal way to compress and reduce the size of your images without losing quality. And once WebP gains support on Safari, switch over to that format. Ruthlessly minimize the total size of your website and keep it as small as possible. Every MB can cost someone real money, and in fact, my mobile carrier (Google Fi) charges a cent per MB, so a 25 MB website which is fairly common nowadays, costs a quarter itself, about as much as a newspaper when I was a child.
- Eliminate the broken URL risk – there are monitoring services that will tell you when your URL is down, preventing you from realizing one day that your homepage hasn't been loading for a month and the search engines have deindexed it. Because 10 years is longer than most hard drives or operating systems are meant to last. But to eliminate the risk of a URL breaking completely, set up a second monitoring service. Because if the first one stops for any reason (they move to a pay model, they shut down, you forget to renew something, etc.) you will still get one notification when your URL is down, then realize the other monitoring service is down because you didn't get the second notification. Remember that we're trying to keep something up for over 10 years (ideally way longer, even 30 years), and a lot of services will shut down during this period, so two monitoring services is safer.
After doing these things, go ahead and place a bit of text in the footer, "The page was designed to last", linking to this page explaining what that means. The words promise that the maintainer will do their best to follow the ideas in this manifesto.
Before you protest, this is obviously not for web applications. If you are making an application, then make your web or mobile app with the workflow you need. I don't even know any web applications that have remained similarly functioning over 10 years so it seems like a lost cause anyway (except Philip Guo's python tutor, due to his minimalist strategy for maintaining it). It's also not for websites maintained by an organization like Wikipedia or Twitter. The salaries for an IT team is probably enough to keep a website alive for a while.
In fact, it's not even that important you strictly follow the 7 "rules", as they're more of a provocation than strict rules.
But let's say some small part of the web starts designing websites to last for content that is meant to last. What happens then? Well, people may prefer to link to them since they have a promise of working in the future. People more generally may be more mindful of making their pages more permanent. And users and archivers both save bandwidth when visiting and storing these pages.
The effects are long term, but the achievements are incremental and can be implemented by website owners without being dependent on anyone else or waiting for a network effect. You can do this now for your website, and that already would be a positive outcome. Like using a recycled shopping bag instead of a taking a plastic one, it's a small individual action.
This article is meant to provoke and lead to individual action, not propose a complete solution to the decaying web. It's a small simple step for a complex sociotechnical system. So I'd love to see this happen. I intend to keep this page up for at least 10 years.
If you are interested in receiving updates to irchiver, our project for a personal archive of the web pages you visit, please subscribe here.
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie Massachi, Alexandra Papoutsaki, my colleagues James Tompkin, Stephen Bach, my teaching assistant Kathleen Chai, and my research assistant Yusuf Karim for feedback on earlier drafts.
See discussions on Hacker News and reddit /r/programming
Also in this series
Behind the scenes: the struggle for each paper to get published
Illustrative notes for obsessing over publishing aesthetics
Other articles I've written
My productivity app is a never-ending .txt file
The Coronavirus pandemic has changed our sleep behavior
Extracting data from tracking devices by going to the cloud
CS Faculty Composition and Hiring Trends
Bias in Computer Science Rankings