Search 101: How to Find the Lost Web

"Privacy activists have a heightened sense of the way censorship works hand in hand with surveillance to build the classic picture of Nineteen Eighty-Four. And when we know a search engine is capable of giving us accurate, relevant results, but doesn't, we realise we're seeing a form of censorship."

Google's lost the internet. You might have seen a few complaints. Whether they've come courtesy of anons in the underbelly of the Fediverse, or a viral soundbyte from Edward Snowden, a growing catalogue of gripes is asserting that web search is no longer fit for purpose. Well, unless web search's purpose is to detect capitalism. In which case thumbs up. The search engines are better than ever at that. They now surface ecommerce, ad-tech, and affiliate-pumped listicle hell so reliably that we barely even need to enter a search term.

But the internet we used to know and love, brimming with offbeat gems from passionate authors... That's gone missing. And with it, the humour. The imagination. The individuality... Maybe we've just forgotten how to use a search engine?...

Nope, it's definitely not us. Change the nuance of the query. Add a tail. Use quotes... It doesn't seem to matter anymore. We get the same list of crap based on one or two commercially-associable keywords, more or less whatever else we type. What we won't get, is what we actually searched for. If we're looking for a specific piece of information, the average web search engine is going to ignore the specifics and hammer us with a scripted smorgasbord of abject capitalism, augmented with a token entry from Wikipedia - frequently the only useful result.

"The value of being able to filter spam chronologically is immense, and it can completely demolish virtually any myth built by the information machine."

But the thing is, Wikipedia has its own search engine, which is mercifully devoid of results from Amazon, eBay, and an army of affiliate-drones' listicle sites. So, if Wikipedia is often the only genuinely useful and non-commercial resource we're finding in the visible output of a web search engine, why are we still running to the likes of Google, Bing and DDG as a first resort? Why do we not just use Wikipedia Search as a basic source of general knowledge results, and supplement that with a range of other search facilities which are at least attempting to give us what we ask for?

That's exactly what this article is going to suggest.

DO WE REALLY NEED TO REPLACE WEB SEARCH?

It's no coincidence that many of the complaints about search quality are coming from privacy activists. Privacy activists have a heightened sense of the way censorship works hand in hand with surveillance to build the classic picture of Nineteen Eighty-Four. And when we know a search engine is capable of giving us accurate, relevant results, but doesn't, we realise we're seeing a form of censorship.

Search engines have nannied us for a long time, assuming by default that we mistyped any query that isn't verified as a popular topic. But we're beyond that now. We're no longer in the realm of "Are you sure you want that?". We've descended into...

"No, you don't want that."

"Yes I do."

"No you don't."

"Yes I do."

"No you don't."

Even if we still regard this as heavy nannying rather than censorship, do we really want to go through that ever-lengthening argument every time we run a web search? Just for the sake of our "framework of mind", as Dickie Valentino put it in the cult 1994 movie There's No Business..., we have to start finding better ways to access information. It's unlikely we'll break free from major web search engines entirely, but now is definitely the time to start reducing our dependency on them.

"Once upon a time you could search Google for the world's most useless product and dive into a motley collection of chucklesome mock ads - topped, if I remember rightly, by an enthusiastic promotion for... Well, is it a screw? Is it a nail? No - it's a scrail! Still makes me laugh today. But that same search produces wall to wall e-corp listicles now. And if you search for scrails you get Amazon trying, in all seriousness, to flog you an actual packet of scrails. It's like, is there any Google search at all you can now run that doesn't eventually lead to Amazon?"

FOR ACCURATE INFORMATION, TWITTER SEARCH IS NOW MORE USEFUL THAN GOOGLE SEARCH

It's almost incomprehensible that the worldbeating sophistication of Google Search could regress so far as to allow a micro-blogging site to provide more relevant information, but that's where we are. And one of the main reasons Twitter Search has become more popular than Google with many people who research for a living, is the platform's rigid protection of chronological integrity.

There are three components to this...

All Tweets are dated and uneditable.
Twitter Search allows us to define a date range.
One of the best ways to find a relevant search result is to filter out spam, and spam tends to come in waves, which are based on trends and current affairs. In other words, a reliable date filter can serve as a reliable spam filter.

When something becomes a talking point, the search results are overwhelmed by spammers, news sites, megablogs, etc, jumping on that talking point in a bid for search traffic. They know everyone is looking for info on that subject, so they produce content about it whether or not they have anything to say. This vast glut of very high ranking domains then squeezes out all of the previous results, and that usually makes finding previously published information through typical web search methods incredibly difficult - if not impossible. Some of these assaults of skimpily-researched verbal diarrhoea actually end up changing history, as the public accept bone idle journalism as truth and the reality is buried out of sight.

But Twitter's Advanced Search allows us to cut through the spam by defining a date range in a search query. Because no one can manipulate the dates of Tweets, or any of the information contained within them, filtering out the period of the spam assault can completely remove the spam.

For example, if you want to know what the consensus on vaccination was in summer 2019 before the covid pandemic, a web search engine will overwhelm you with spam about the covid vax. But if you search on Twitter, limiting the date range to summer 2019, you will get precisely the consensus that existed at the time, and nothing else. There's no contamination, because no one can fake a summer 2019 Tweet. If the Tweet is dated June, July or August 2019, then that's when it was published, and its content is what it contained back then. This is very different from the output of web search engines, where random third parties control the information sources, and you have all manner of people manipulating both post content and post dates in a bid to win traffic.

BUSTIN' MYTHS

The value of being able to filter spam chronologically is immense, and it can completely demolish virtually any myth built by the information machine. In the 2010s I got curious as to the origins of the Twitter hashtag. I wanted to know who invented the idea. Wikipedia and a clutch of other sites assured me it was Chris Messina. How predictable, I thought. Twitter hashtag invented by a high-profile, privileged dude with connections galore. But my life experience told me that well-connected dudes with high public profiles are much better associated with taking credit for inventions than actually inventing them. So I decided to check out the story on Twitter itself.

Because of Twitter's chronological integrity and the fact that I could restrict the period of investigation to a time before Messina's claim, I was able to establish that Messina did not in fact invent the Twitter hashtag. I wrote a post documenting the truth back in 2016. Sadly, it's been one of the least visited posts I've ever written. The search engines are quite happy with wall to wall regurgitations of the Wikipedia line. But the post does demonstrate how much more accurate Twitter can be as an information source than a typical web search engine. And whilst single Tweets are limited (by character-count) in their ability to elaborate on a story, collectively they can prove extremely thorough in the picture they provide.

"These obscure search engines are incredibly refreshing to use, because they deliberately punish the exact, cash-crazed ideology that Google goes out of its way to reward."

Twitter also affords us a directional filter on information. By default, we only really see what influential voices are saying. But we can filter a Twitter Advanced Search to show only the replies TO those influential voices. That directional filter can serve as an ideological filter and quickly take us to the opposing views which a web search engine can easily hide.

This works brilliantly where marketing or propaganda is strong. For example, a brand is only ever going to tell you what it gets right. Never what it gets wrong. The brand will typically use SEO strategies with web search engines, to ensure that its official messaging occupies the whole front page, and that the more negative feedback is buried under a continuous spew of marketing. But using Twitter Search we can completely filter out the brand's own messaging and search only the replies to it. This gives a much truer picture of the brand's performance, and we additionally get to see whether the brand addresses issues raised by members of the public, or simply ignores them.

"It's no longer about the consumer. It's 50% an elitist closed shop in which Amazon, eBay, YouTube and Co. win by default, and 50% a "which established e-corp can bribe the most PR7s and pump the most elaborate data graph into Silicon Valley?" contest."

CUSTOMISED SEARCH

Instances of the decentralised search engine Searx (listed here - page requires JavaScript) are often recommended as an alternative to bigger web search engines. But it's rarely explained how the search capabilities offered by Searx can be rigorously customised to focus on the best sources of information for a given subject.

Searx is all about metasearch. That is, compiling results from a variety of different search indexes. But with Searx, you can choose which indexes you want to query. If you've explored and tested various instances of Searx, you've probably noticed that the search results can be vastly different from one instance to the next. That's because each one is set up by its administrator to query a different selection of indexes. But the range of sources a Searx instance queries is also open to user-customisation. By going into the Preferences, you can define exactly whose results you want, and whose you don't.

I'll use Searx Belgium as an example, because I've found it to be reliable. There are tabs along the top of the results page that denote categories of search. Once you've entered a search term and have a results list on screen, you'll see that the results list is headed with horizontal selection options such as General, Images, Videos, News, etc. Unlike with Google, you can simultaneously choose as many or as few of these search categories as you like. Just select the tab or tabs you want and then re-click the Start Search button.

"The Searx Preferences page illustrates just how many different search resources there are, and names them so we can investigate them in their own right."

Let's say you de-selected the General tab - which is selected by default - and instead selected the Social Media tab. You'll see a dramatic change in the results. Rather than being sourced from Google, Wikipedia, etc (which are Searx Belgium's default sources for General search), the results are now solely coming from Reddit (which is Searx Belgium's default source for Social Media).

I really like having the option to get a selection of results solely from Reddit, because community Q&A discussion is broadly a lot more genuine than the output of some listicle merchant whose real goal is not to help you solve a problem, but to pocket some commission from Amazon. Even if the contributors on Reddit are not experts (and sometimes they are), collectively they're likely to get you closer to a real solution than an expert blogger who isn't even trying to help.

True, we could confine Google or DuckDuckGo search results to Reddit by prefixing our search term with site:reddit.com - and this is one of the only really reliable techniques left of filtering out the annoying spam on major web search engines. But we've come to expect greater convenience than having to type a website domain into a search box, and that's what the tab system on Searx gives us.

Out of the box, the Searx instance in our example already offers some easy ways to customise the search results for specific needs. But by pitching into the Preferences, we can further tailor the sources for each of those category tabs. For example, we could restrict the image search sources solely to Unsplash, or Flickr. Then we filter out all of the news site spam and very predominantly find photography enthusiasts instead.

"Independence from major web search is something we can, and should, try to build progressively."

Incidentally, if you do make any changes in Searx Preferences, don't forget to scroll down and hit the Save button at the bottom of the page. Otherwise your changes won't register. You'll also need to have cookies enabled for the browser to remember your prefs.

The next step up from here is spinning up your own Searx instance. This requires the use of a server (although it's included in the pre-packaged installation options if you use FreedomBox). It does, however, afford you an even more detailed realm of customisation. Not everyone will go this far, but the option is there for those who want to take it to another level.

One of the other great benefits of the Searx Preferences page, beyond simply changing the searchable indexes, is that it illustrates just how many different search resources there are, and names them so we can investigate them in their own right. For instance, you might spot Wiby among the General search options. What's Wiby?...

UNDERGROUND SEARCH ENGINES

Wiby sits aside Marginalia, representing a budding breed of search engines that shun the modern internet and focus on the more simple and imaginative web of yesteryear. A time of enthusiasm, as opposed to pathological obsession with revenue. These obscure search engines are incredibly refreshing to use, because they deliberately punish the exact cash-crazed ideology that Google goes out of its way to reward. Within moments, the offbeat output from these underground resources illustrates just how tiresomely predictable Google Search and its derivatives have become.

"We simply can't trust a search engine to find a useful post again next week, so anything at all that we have serious intentions of revisiting, we realistically need to bookmark."

That small operators can build these products with limited indexes, and serve results which wake us up in a way that the mighty, multi-$billion Google has long since ceased doing, attests to a stark reality... Google no longer wants to stimulate us mentally. It just wants to haul us into a commercial brainwashing system and fire off its bullshit-ass lab-ratting schemes in every last corner of our itinerary.

If you've recently tried to use a major web search engine to find original, detailed, historical web analysis published in the 1990s or early 2000s, you'll know how deeply frustrating it can be to solidly encounter 500-word SEO spins that some half-assed journalist wrote on a news site in 2020 or 2021. This is where underground engines like Marginalia and Wiby really come into their own. If you want to know what people were writing about Windows 98, in 1998, the best chance you have of achieving that with a minimum of hacks and advanced workarounds, is with a search engine like Marginalia or Wiby.

USING THE WIKIPEDIA CITATIONS LISTS AS SEARCH RESULTS

Another clever way to access high quality, vetted resources with zero spam, and zero advertising, is to employ a two-step process in which the Wikipedia citations lists serve as sets of search results. Search your query on Wikipedia, click through to the relevant page, then scroll straight to the bottom and review the References, Sources or External Links sections. Whilst a lot of entries will be hard copy books or links to other pages on Wikipedia itself, there are usually some links to definitive posts on other websites.

Wikipedia operates in a parasitic manner, taking information from everywhere, whilst using "nofollow" link attributes to low-key strip its sources of validity in the eyes of Google. So what tends to happen is that the Wiki rises higher and higher in the search results, while the visibility of the sources steadily declines. This means that many of the sources cited in the Wikipedia References lists, even though extremely high quality, are no longer prominent on major web search engines. A perfect illustration of the problems with search engines, as well as the Machiavellian behaviour of Big Tech - of which Wikimedia is a component.

"You can create your own small search engine just by scaling up your bookmark collection. I would recommend this to anyone."

But we can use the Reference lists themselves as valuable pointers to quality sources, which lead us to real experts who can give us deeper insight. In general knowledge fields, this method can be a lot more productive (and certainly more reliable) than merely querying a major web search engine. In using this method and visiting original, high quality source sites, you're also helping to reward the people who have been screwed over by Wikimedia and Google.

COLLECTIVE STRENGTH

So, our search bookmarks now combine Wikipedia Search, Twitter Advanced Search, a customised Searx instance or two, and the retro-focused Wiby and/or Marginalia. This collective base gives us better access to truthful, useful and insightful information than we'll get from hopefully banging queries into Google, Bing or DuckDuckGo. And importantly it also helps free us from much of the timewasting irrelevance that formerly dominated our search results. Simply, we see the word "Amazon" a heck of a lot less, and if you're anything like me, that's a lifestyle-improvement in itself.

There will still be times when we need a more mainstream engine, but the mainstream engines are now too overwhelmed with what Google used to rank down as "webspam", to serve as a first resort. Now that Google positively loves and encourages "webspam", and instead spends its time ranking down sites that don't give friggin' Tag Manager enough gainful employment, it's no longer about the consumer. It's 50% an elitist closed shop in which Amazon, eBay, YouTube and Co. win by default, and 50% a "which established e-corp can bribe the most PR7s and pump the most elaborate data graph into Silicon Valley?" contest.

"This is a conspiracy. And conspirators rarely fool the public forever."

It's hard to stop relying on major search engines, because the one advantage they still do have is convenience. Being able to search everything from one place becomes a habit, and it's a hard habit to break. But we have to move on from a reliance on search engines like Google before things get even worse.

BUILDING A BANK OF NICHE RESOURCES

Independence from major web search can also be built progressively. If we make a point of looking for search facilities on sites that do provide us with good value, and then use those searches directly in future, we steadily reduce our reliance on the ad machine.

For most people, web search engines don't really serve that many specific purposes. So whilst we might imagine having to build a very long list of niche resources in order to replace something like Google, in fact, a relatively limited number of entries will cover most of the ground.

As a writer, one of my common queries is a synonym search. I became aware that I was searching for synonyms a lot, and that the sites I ended up visiting often gave poor matches, or had a grim user-experience. So the next time I found a site that gave me a good user experience and useful synonyms, I bookmarked the site - WordHippo - and used their internal search instead of constantly searching the whole web. Much quicker, no wading through ecommerce entries, and it's done the job.

BOOKMARKING - A MEASURE OF THE FAILED SEARCH

Through the twenty-tens I realised I was bookmarking more and more URLs, and I can see today that I do it obsessively. We've reached a point where we simply can't trust a search engine to find a useful post again next week, so anything at all that we have serious intentions of revisiting, we realistically need to bookmark.

But well-categorised bookmarking is another compound escape route from major web search engines. Often, we know exactly which site we want to go to, but we don't remember the URL or the precise domain name. Is it .com or .org? .net or .me? Or .co.uk?... Many of us just tap the site name into a search engine and hit the link in the results. And even then we don't necessarily make a mental note of the domain name. We just keep repeating the same behaviour. Running that site name through a search engine every time we want to visit. This might happen twenty times, forty times, or more. So just by adding one bookmark to a browser, we might save ourselves scores of web searches. It all adds up.

"If dismantling a computer is more convenient than completing what should be a straightforward search process, we really do need to think again."

The general rise in bookmarking is a measure of how little confidence we now have in web search. If you run a website, you're probably seeing a lot more visits from bookmarking resources than you did even just two or three years back. But I was struck a couple of weeks back by the lengths to which I was prepared to go in order to use a bookmark rather than a web search...

Recently I switched my day-to-day operating system setup from Bodhi Linux with a dual-booting Windows partition, to a standalone Linux Mint installation. I left the Bodhi/Windows hard drive in the PC, but disconnected it, then fitted a new disk, and installed Mint. Wicked. Perfectly happy with Mint - no desire to get back into Windows... Until I wanted to find an answer to a tech query that I'd seen on Reddit.

I tried two or three web searches and could see I was getting nowhere. I could have pitched into the usual cat and mouse game of trial and error, using strategic quotes, increasingly long tails, etc. But the thought of all that was actually so gruelling that I instead switched off the computer, unscrewed the casing, connected up the other disk, rebooted, and went into Windows, where I knew I'd saved the bookmark. We think of search engines as a covenience, but if dismantling a computer is more convenient than completing what should be a straightforward search process, we really do need to think again.

SEARCH CRISIS

The creeping perception that web search is not serving our needs is as much a crisis for search engines as it is for us. To date, it hasn't dented the top search engines' profitability, because Google just blasts more and more Big Tech real estate into the results, making the elitist cartel more and more money per query. But there's only so far that can go. If it reaches the point where the public can reliably predict which sites they're going to find in the search results, there is no longer any point in them using a web search engine.

We're already well down that road, and for Google, there's no turning back. If Google boots its own, its partners', its lobbying pals (raise your hand Wikimedia), and its corporate supplicants' domains out of the results to allow the wider web back into the picture and restore public faith, its profits are going to bomb. Google now relies on corrupt search algorithms to hit its financial targets, so the only question is how much of that increasingly unsightly road there is left to travel before people begin jumping off the cart in volume.

And at present? Well, it isn't that people don't realise how bad the search results now are. They absolutely do. In research on Twitter, I found a broad recognition that today's search results are worse than they used to be. It's just that people blame publishers, and not the search engines, for the decline.

But the thing is, the pages of yesteryear have not gone anywhere. Wonderfully entertaining sites I became aware of in the late 1990s are still up, still rigorously maintained and updated with high quality writing. But they never show up today in the search results. If I hadn't found out about them years ago, I wouldn't know they existed. And if you Google the names of their admins - the internet-famous of the AltaVista era - the results are topped not by their excellent sites, but by Facebook and LinkedIn accounts. Some belonging to randoms with the same name. So this is not a decline in publishing standards. This is a conspiracy. And conspirators rarely fool the public forever.