Daniel D. Beck

4 ways to keep PDF docs off Google search results

If you offer documentation as both web pages and as PDFs, then there’s an annoying consequence for readers who search for your docs: Google might rank your PDFs as high or higher than your regular web pages.

When a search engine results page ranks your PDFs highly, it draws traffic away from your freshest, most-accessible documentation. It funnels traffic to a format that is “unfit for human consumption” on the web.

But you might not be able to get rid of those PDFs just yet. Maybe you need a print-only version of your docs or you can’t yet produce better offline formats (such as EPUB). Wouldn’t it be nice to continue to offer PDFs without compromising search results?

It’s SEO, but not the under-handed kind

You can’t manipulate search results directly, but it’s possible to give hints to Google and other search engines to stop them from driving traffic to your PDFs. It’s a little bit of search engine optimization, but not the under-handed kind.

There are at least four methods that can help:

  1. The canonical method: tell search engines to treat a PDF URL as equivalent to another URL using a special HTTP response header.
  2. The noindex method: tell search engines to ignore a PDF URL using an HTTP response header.
  3. The no-robots method: try to hide a URL from Google and other web crawlers using changes to your robots.txt, sitemap, and links.
  4. The password method: password protect the PDFs themselves, so Google ignores them.

Read the following sections to learn when each method works best and how to use it. And, whichever method you choose, don’t miss the general tips at the end.

The canonical method: tell Google to prefer web pages over PDFs

The first and best method is to tell Google’s crawlers (and other web clients) that there’s another, better URL for search results: the canonical link.

How it works: Set the Link header with the rel="canonical" parameter in responses to requests for PDFs. It’s like using a <link rel="canonical"> tag in an HTML file, but for non-HTML files.

Advantages: This method usually preserves or strengthens your search results ranking. When you serve a PDF with a canonical URL, the PDF URL’s ranking is added to the canonical URL’s ranking. In its index of sites, Google consolidates the two URLs and prefers to show the canonical URL in search results.

Disadvantages: This method assumes that there’s a single, preferred alternative to your PDF (such as a web-friendly page or a PDF gateway page) and that you have control over your web server or site hosting configuration. If you can’t satisfy both of these requirements, then you’ll need to use another method.

Example: Suppose you serve a PDF at https://docs.example.com/assets/user-guide-v3.2.1.pdf, which is a print alternative to the web content at https://docs.example.com/user-guide/. Serve the PDF with the following HTTP header:

Link: <https://docs.example.com/user-guide/>; rel="canonical"

Or here’s what it looks like in the context of an actual response:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 99778
Content-Type: application/pdf
Date: Wed, 28 Jul 2021 09:21:52 GMT
Link: <https://docs.example.com/user-guide/>; rel="canonical"

Configuring your web server to do this can be a bit tricky, since it’s specific to your web server or site hosting service. For example, on Netlify, the configuration for Link in netlify.toml looks like this:

[[headers]]
  for = "/assets/user-guide-v3.2.1.pdf"

  [headers.values]
    Link = '<https://docs.example.com/user-guide/>; rel="canonical"'

Further reading: To learn more about how Google interprets this method, read Google’s Consolidate duplicate URLs docs.

The noindex method: tell Google to ignore your PDFs

The next best option is to tell search engines to explicitly ignore your PDFs with the noindex header value. This method is like the canonical method, except that it drops a URL from search indexes instead of consolidating it with another URL.

How it works: Set the X-Robots-Tag header with the noindex value in responses to requests for PDFs. This is like using a <meta name="robots" content="noindex"> tag in an HTML file, but for non-HTML files.

Advantages: This method usually removes a PDF from search results. It works well in situations where you’re continuing to serve a PDF with no web-friendly alternative and you don’t want new readers to stumble upon the PDFs.

This method also works well when you need a one-size-fits-all fix for PDFs in search results. If you have many PDFs and it’s impractical to figure out a canonical URL for each, then dropping them from search might be a practical solution.

Disadvantages: Like the canonical method, this method assumes you have control over your web server or site hosting configuration (though it will probably be less fussy for bulk use). If you can’t change your server’s headers, then you’ll need to use another method.

Example: Suppose you serve many PDFs at addresses starting with https://docs.example.com/assets/ and you want to remove them from search results. Serve the PDFs with the following HTTP header:

X-Robots-Tag: noindex

Or here’s what it looks like in the context of an actual response:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 99778
Content-Type: application/pdf
Date: Wed, 28 Jul 2021 09:21:52 GMT
X-Robots-Tag: noindex

For example, on Netlify, the configuration to set X-Robots-Tag for every PDF in netlify.toml looks like this:

[[headers]]
  for = "*.pdf"

  [headers.values]
  X-Robots-Tag = "noindex"

Further reading: To learn more about how Google interprets this method, read Google’s Block Search Indexing with ‘noindex’ docs.

The no-robots method: hide your PDF URLs from Google (if you can)

The next option is to obscure your PDFs from Google’s gaze in the hopes that it will de-rank PDFs in search results.

How it works: Minimize search crawlers’ ability to find URLs for PDFs and block them from visiting the URLs they do find. This combines a few individual techniques:

Advantages: This method doesn’t require you to have control over your web server’s HTTP headers, so it may be especially useful when you’re hosting PDFs on a service that doesn’t let you configure headers, such as GitHub Pages. It also doesn’t add any hurdles to opening your PDFs, as the next method does.

Disadvantages: This method requires the most effort to implement and it’s the least likely to succeed. Even if you forbid a search engine from visiting a PDF, it can still index that PDF (based on links to it, some of which you may not control) and show it in search results, though it will probably rank it lower.

Example: Suppose you serve a PDF at https://docs.example.com/assets/admin-guide.pdf. Do this:

Further reading: To learn more about how Google interprets this method, read Google’s docs:

The password method: lock Google out of your PDFs

The last option is to password protect your PDFs, which ought to cause your PDFs to fall out of search results.

How it works: Strongly discourage Google from ranking your PDFs by applying password protection to your PDFs. Search engines don’t typically rank results which are known to require a password, so this approach has a similar effect as the noindex method.

Advantages: This method is easier to implement than the no-robots method, particularly if you don’t have control of your web server’s HTTP headers. It’s also more likely to work than the no-robots method.

Disadvantages: This method creates the poorest experience for your readers. To read your PDFs, they need to receive the password from you, know how to use it, and do so every time they open the PDF. This is likely to annoy your readers and lead to many support requests.

If you have few PDF readers, then this method may be tolerable. If not, strongly reconsider one of the previous methods.

Example: Apply a password to your PDF, using your PDF software of choice, then replace the original PDF with the password-protected version.

Further reading: To learn more about this method, read Google’s Control what you share with Google docs.

General tips

If you’re stuck serving PDFs, then hopefully there’s a least-worst option for dealing with PDFs in search results. But no matter what method you choose, keep these closing tips in mind:

Don’t combine strategies for a single URL. While you can use different methods for different URLs (such as canonical for one PDF and noindex for another), don’t try combining methods for the same URL. They may cancel each other out or lead to surprising results.

Be patient. Search results including your PDF won’t change immediately. Search engines periodically crawl the web, looking for new pages and changes to existing pages. Your hints to Google won’t be reflected in search results until after your site is recrawled. You might have to wait weeks before your site is recrawled, particularly if it’s new or low-traffic.

Focus on content. While it’s good to do a little SEO for things that cause a poor experience for your readers (like splitting results between equivalent PDF and non-PDF content), don’t get obsessed with optimizing results for individual URLs. The fundamentals of web content—such as writing headings that contain keywords relevant to your readers—matter more than tweaking Google’s results.