Update: This post got a ton of traffic on Hacker News today and Pinterest reached out to comment: “The claim that we scrape Google search results is false. We do not, and never have, scraped Google search results at any time.” The original article suggested Pinterest scrapes Google directly, but instead it seems more likely that Pinterest grabs data from Google through it’s Chrome Extension. We’ll update this post as we learn more from them.
A few weeks ago in the Twitterverse, @SwiftOnSecurity outed Pinterest for using a somewhat surprising SEO tactic: for every image uploaded to Pinterest that doesn’t have any real metadata or description of the picture, Pinterest automatically performs a reverse image search on Google, scrapes all of the metadata and descriptions they can find for that image, and then uploads that content onto their site and pretends it’s from their own users.
This is interesting for a couple of reasons:
Content relevance is a ranking factor in Google. The closer semantically you can describe a topic or image to how Google understands it, the better your chances of ranking higher in their index. Will Google find this behavior flagrantly blackhat and respond accordingly?
Pinterest has been super successful with SEO growth over the years. Their post on Demystifying SEO with Experiments was particularly inspirational for me in deciding to start RankScience, an SEO automation and A/B testing company. So any time I hear about programmatic SEO tactics that work on a site as large as Pinterest, with 800M+ pages indexed in Google, I’m intrigued. This is obviously a strategy they would never talk about doing publicly, so it’s fascinating to see it exposed and called out like this.
Voila! Instant unique and scalable SEO text content that maps directly to Google’s understanding of the photo. Google indexes the Pinterest page with the new text content and ranks it higher because of the strong relevance of the text on the page to its existing understanding of the photo. Rinse and repeat across millions of photos.
John Mu, Webmaster Trends Analyst at Google, and part of the webspam team responsible for policing SEO behavior, chimed in on the thread and offered support for the content available on Pinterest. He didn’t comment directly on this behavior, but I’d bet that the popularity of this thread alerted some people at Google and that there’s an investigation going on internally into this practice at Pinterest. ($PINS) The only reason that Google would let this slide is that they don’t view policing Image Search as high priority.
Content relevance is an important ranking factor in Google search. It’s widely accepted that Google calculates relevance for individual URLs and pieces of content as they relate to a particular query or keyword, and that these quantitative relevance calculations play a role in its ranking algorithms. In the Pinterest example, they’re taking an image that Google already knows about and grabbing multiple text descriptions of that image from Google itself, then combining them in one place to provide one comprehensive page describing the image. This maps exactly to Google’s existing understanding of that image, so the page then likely achieves a very high content relevance score.
One way you can apply what Pinterest is doing to improve the rankings of content on your own site is to use a NLP method called TF-IDF (term frequency-inverse document frequency). This is a text analysis technique that helps reveal how important a word or phrase is to a document in a corpus (example: a collection of URLs). You can either break out a spreadsheet and do this by hand, or use an advanced content optimization tool like RS Content Insights to do this analysis at scale.
Let’s say that you wanted to rank in Google for google image search seo. We already know which documents Google thinks are the most relevant and highest authority for this search term because those are the pages that show up in search results. So we can start by downloading the top 25 URLs ranking in Google for google image search seo and performing tf–idf analysis across all of those documents to reveal key topic entities that are semantically related to the search term.
Here are the topic entities produced by TF-IDF when we ran this post that you’re reading right now through Content Insights for google image search seo.
You’ll see that TF-IDF analysis suggests using keywords like alt tags, alt text, image quality, file size, and stock photos, which are all associated with google image search seo, even though they are not replacements or alternatives to the keyword. This gets directly at Google’s understanding of the topic and using this method you can get your content ranking your post higher for having a better content relevance score — it’s often surprisingly effective. In addition to SEO A/B testing, which everyone should be doing by now, using NLP and TF-IDF to refresh and update existing long-form content on your site is an incredibly effective way to grow search traffic and rankings in 2020, and an important tool in any marketing team’s tool kit.
Get Data-Driven about growing your traffic with RankScience.