When you add a link in a chat message or share it on social networks like Facebook, Twitter, LinkedIn, you can see a small preview and a short description of the link. The main benefit of link previews is that users have some expectation of what they will get before opening the link.
In this blog post, we will create a solution, which converts a link:
In this:
I released this solution as npm packaged. You can check the source code on Github and the demo implemented on Heroku.
Nowadays, we can see the link preview feature in almost all social networks or chats. app, where users can send or share url links. In this blog post, I want to share with you how to create a link preview function without a third-party API. I’m going to describe the entire strategy of creating link previews, including implementation using open source libraries in node.js.
Why did I decide to write this blog post?
When I needed to create a preview link function, I came across a lot of misleading or outdated information on this topic. If I found a solution that worked, it was based on some paid 3rd party APIs. I hope this article saves you a lot of time figuring out how to build this function with open source libraries in any back-end language.
What should be included in a link preview?
A URL link preview usually contains the title, a description, the domain name, and an image. You can create even richer link previews by providing other information. For more details, see Additional Tips.
How to get data to preview a link?
Facebook launched the Open Graph protocol in 2010, which is now managed by Open Web Foundation. The main goal is an easier integration between Facebook and other websites. That being said, Open Graph Protocol allows you to control what information is used when sharing a website. If websites want to use the Open Graph Protocol, they must have Open Graph meta tags in the part of the website’s code.
Other social networks also take Open Graph Protocol into account. However, Twitter created its own tags for Twitter Cards, which are called Twitter Card Tags. They are based on the same conventions as the Open Graph protocol. When the Twitter card processor looks for tags on a page, it first looks for the Twitter-specific property, and if it’s not present, it falls back to the supported Open Graph property. More information can be found in the Twitter documentation.
The following Open Graph tags are used to create link previews:
Open Graph Title
This tag works the same as the . Allows you to define the title of the content. If Facebook can’t find the og:title tag, use the
There is no limit on the number of characters, but the title must be between 60 and 90 characters as a meta title. Otherwise, it may be shortened or truncated. For example, Facebook will truncate it to 88 characters.
Open Chart Description
This tag is again similar to the meta tag description. This is where the content of the website is described. Similar rules apply to this tag as for the title tag. If a social media bot can’t find the og:description tag, it uses a meta description, and there’s no limit to the number of characters. In this case, you should use around 200 letters.
Open Graph Image
An image is probably the most eye-catching element in the link preview. You can define the image with og:image title. The recommended resolution is 1200 pixels x 627 pixels (1.91/1 aspect ratio) and the image size should not exceed 5 MB.
Open Graphic URL
This tag defines the canonical URL of your page. The URL provided is not displayed in the Facebook newsfeed, only the domain is displayed.
You can find a full list of available og tags on the Open Graph website.
How to get data without metadata and og tags?
There are many websites without meta tags and basic og tags. What data should we preview in this case?
We can use data in the body of the document.
The title
If the website does not contain a meta title tag or og: title tag, we can consider a heading in the document body as the main title. The most important heading in the body of the document is
. If the website does not contain the
tag, we can search for
tags.
tags.
The description
The strategy for getting the description of the website is similar to getting the title. . If the document does not contain a meta description or og:description, we can consider the main text of the document as the description of the website.
The domain name
We will search for or og:url. If the document doesn’t contain one of these, we’ll use the url parameter.
The image
Of all the attributes mentioned, the image is the most complicated element.
Which image should represent the URL of the website, if the document html doesn’t contain the og:image? tag
There is another way to specify the image of the website. There is a link tag with the rel=”image_src” attribute in the following format:
However, we can find many websites without og : image or tag. In this case, we need to parse the images in the document body.
Raymond Camden described in his 2011 blog post how Facebook and Google+ used to determine which image to use for link preview. Facebook used the og:image and tags and Google+ used the first tag in the html body. Neither of these strategies seems correct, because Facebook did not consider images in the document body and Google+ chose the first image, which could be an image for the layout.
Slack published a blog post, how do they create link previews, but do not take into account images in the html body.
How does Facebook determine which images to display as thumbnails when posting a link?
Candidate images are filtered using javascript which removes all images less than 50 pixels tall or wide and all images with a longest dimension to longest dimension ratio cut greater than 3:1. Leaked images are sorted by area, and users can choose whether multiple images exist.
quora.com
By removing the ability to customize link metadata (ie title, description, image) from all link sharing entry points on Facebook, we are removing a channel that has been abused for posting fake news .
developers.facebook.com
I think the described strategy works well. Images less than 50px tall or wide are perhaps icons, images with an aspect ratio greater than 3:1 don’t fit well in previews. Images with a larger area are perhaps more important to website content than smaller images.
Implementation
You can find several attempts to create a library that implements the function of preview links.
There is a node.js “fix” on AWS Lambda. Unfortunately, the main library and its source code repository are no longer available.
Is there open source code for creating ‘link preview’ text and icons, such as on Facebook?
stackoverflow.com
I couldn’t find any open source implementations, so let’s build one.
Libraries used
If you want to implement the whole strategy for creating link previews, you should use a library that allows you to access the DOM structure of the html document. In the node.js environment, I found three libraries that allow you to access the DOM:
- JSDom simulates a web browser environment in node.js and allows you to access the DOM structure
- Puppeteer lets you control Chrome without a GUI from Node.js
- PhantomJS, a non-gui web browser scriptable with JavaScript
JSDom doesn’t work ok, because we need to be visible url elements and JSDom doesn’t parse css styles well [1, 2].
If you need to choose between Puppeteer and PhantomJS, I would recommend using Puppeteer, because PhantomJS development has stopped and Puppeteer is faster and requires less memory.
Configuring Puppeteer for Web Scraping
Puppeteer has many options and allows you to configure Chrome with various settings. Therefore, using Puppeteer for the first time is not that simple. Before you can open websites in Puppeteer, you must configure it to extract data from websites.
Some websites do not want you to extract data. In this case, you can use puppeteer-extra-plugin-stealth, which uses various techniques to make it more difficult to detect a headless puppeteer.
If you want to interact with the website in Puppeteer, you must use the Function page .evaluate(), where Puppeteer runs the script in the browser, not in node.js. If you have other modules or functions that you want to use in the evaluate function, you should use page.exposeFunction(). Modules imported into node.js are not accessible in the Puppeteer browser, and the expose function allows you to expose functions in the browser.
When the browser makes a request to a website, it sends an HTTP header called “User Agent”. The user agent contains information about the web browser. Some websites do not provide meta tags for common user agents. In Puppeteer, you can configure the Facebook crawler user agent because, in most cases, websites want to provide metadata for Facebook.
Strategy for getting individual elements for link preview
We are going to implement the following strategy in node.js, which should be applicable in all back-end languages.
The title
Find og:title in the document header.If og:title does not exist, look for the meta title tag in the document head. If the meta title doesn’t exist, look for the
tag in the body of the document. If
does not exist, look for the first occurrence of the
tag in the document body.
tag in the document body.
const getTitle = asynchronous page => { const title = await page.evaluate(() => { const ogTitle = document.querySelector(‘meta[property=”og:title”]’); if (ogTitle != null && ogTitle.content.length > 0) { return ogTitle.content; } const twitterTitle = document.querySelector(‘meta [ name=”twitter:title”]’); if (twitterTitle != null && twitterTitle.content.length > 0) { return twitterTitle.content; } const docTitle = document.title; if (docTitle != null && docTitle. length > 0) { return docTitle; } const h1 = document.querySelector(“h1”).innerHTML; if (h1 != null && h1.length > 0) { return h1; } const h2 = document.querySelector(“h1 ” ).innerHTML; if (h2 != null && h2.length > 0) { return h2; } return null; }); return title; };
Source: github.com
The description
Find og:description in the document header. If og:description doesn’t exist, look for the meta description tag in the document head. If the meta description tag doesn’t exist, parse the document body text. Finds the first visible paragraph, whose text is the site description.
const getDescription = asynchronous page => { const description = expect page.evaluate(() => { const ogDescription = document.querySelector( ‘meta[property =”og :description”]’ ); if (ogDescription != null && ogDescription.content.length > 0) { return ogDescription.content; } const twitterDescription = document.querySelector( ‘meta[name=”twitter:description”] ‘ ); if (twitterDescription != null && twitterDescription.content.length > 0) { return twitterDescription.content; } const metaDescription = document.querySelector(‘meta[name=”description”]’); if (metaDescription != null && metaDescription.content.length > 0) { return metaDescription.content; } paragraphs = document.querySelectorAll(“p”); let fstVisibleParagraph = null; for (let i = 0; i < paragraphs.length; i++) { if ( // if object is visible in dom paragraphs[i].offsetParent !== null && !paragraphs[i].childElementCount != 0 ) { fstVisibleParag raph = paragraphs[i].textContent ; break; } } returns fstVisibleParagraph; }); return description; };
Source: github.com
The domain name
Find or og:url. If the document does not contain one of these, use the url parameter.
const getDomainName = async (page, uri) => { const domainName = await page.evaluate(() => { const canonicalLink = document.querySelector ( “link[rel=canonical]”); if (canonicalLink != null && canonicalLink.href.length > 0) { return canonicalLink.href; } const ogUrlMeta = document.querySelector(‘meta[property=”og:url” ] ‘); if (ogUrlMeta != null && ogUrlMeta.content.length > 0) { return ogUrlMeta.content; } return null; }); return domain name! = null? new URL (domain name). hostname. replace(“www.”, “”): new url(uri). hostname. replace(“www.”, “”); };
Source: github.com
The Image
Find og:image in the document header. If og:image doesn’t exist, look for the tag in the header. If the tag does not exist, search for all images in the body of the document. Delete all images that are less than 50 pixels in height or width, and all images with a longest dimension to shortest dimension ratio greater than 3:1. Returns the image with the largest area.
const util = require(“util”); const request = util.promisify(require(“request”)); const getUrls = require(“get-urls”); const urlImageIsAccessible = asynchronous url => { const correctedUrls = getUrls(url); if (correctedURL.size! == 0) { const urlResponse = wait for request(correctedURL.values().next().value); const contentType = urlResponse.headers[“content-type”]; returns new RegExp(“image/*”).test(contentType); } }; const getImg = async(page, uri) => { const img = expect page.evaluate(async() => { const ogImg = document.querySelector(‘meta[property=”og:image”]’); if ( ogImg != null && ogImg.content.length > 0 && (expect urlImageIsAccessible(ogImg.content)) ) { return ogImg.content; } const imgRelLink = document.querySelector(‘link[rel=”image_src”]’); if ( imgRelLink != null && imgRelLink.href.length > 0 && (expect urlImageIsAccessible(imgRelLink.href)) ) { return imgRelLink.href; } const twitterImg = document.querySelector(‘meta[name=”twitter:image”]’) if ( twitterImg != null && twitterImg.content.length > 0 && (expect urlImageIsAccessible(twitterImg.content)) ) { return twitterImg.content; } let imgs = Array.from(document.getElementsByTagName(“img”)); if (imgs.length > 0) { imgs = imgs.filter(img => { let addImg = true; if (img.naturalWidth > img.naturalHeight) { if (img.naturalWidth / img.naturalHeight > 3) { addImg = false ; } } else { if (img.naturalHeight / img.naturalWidth > 3) { addImg = false my; } } if (img.naturalHeight <= 50 || img.naturalwidth img.src.indexOf(“//”) === -1 ? (img.src = `${new URL(uri).source}/${src}`) : img. origin); return imgs[0].src; } returns null; }); return image; };
Source: github.com
Testing
If you want to test your link preview implementation, you can use the Facebook sharing debugger. This is a free tool, which scrapes any web page hosted on a public server and shows how it would look when shared.
Additional Tips
Your link previews can still be richer and provide more information to users. For example, if the website contains the og:video tag, you can replace the image with video. There is other information that you can use in the previews. There are specific tags for articles, books, or profiles.
Consider setting up a proxy or using IP rotation for your server, as some websites try to detect web scraping and block it. Some websites block users from specific countries. If you need more tips to avoid web scraping detection, you can refer to this article.
Conclusion
In this article, we describe how social media and chat apps create previews of links. We then describe the implementation, which can be used in any back-end language. As an example, we implement the entire solution in node.js. The result is an open source node.js library and the demo is implemented on Heroku.
As you can see, creating a preview function Link building is easy if you use the right approach. You don’t need to depend on third-party APIs and pay for similar services.
.