Load a SPA webpage via AJAX

I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.

So for example, when I route to the following address:

https://connect.garmin.com/modern/activity/1915361012

And then enter this into the console (after the page has loaded):

var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());

Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:

Working Screenshot

However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):

view-source:https://connect.garmin.com/modern/activity/1915361012

So if I use either of the the following AJAX calls:

// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim()    );
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

I'll still get the initial footer, but won't get any of the other page contents:

Broken - Screenshot

I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:

jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    $page.find("script").each(function() {
        var scriptContent = $(this).html(); //Grab the content of this tag
        eval(scriptContent); //Execute the content
    });
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

Q: Any options to fully load a webpage that will scrapable over JavaScript?

Answer

August 30, 2017

Hugues M.

You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.

The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.

I wanted to try Headless Chrome so let's see what it can do with your page:

Quick check using internal REPL

Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:

% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim() 
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}

NB: to get chrome command line working on a Mac I did this beforehand:

alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"

Using programmatically with Node & Puppeteer

Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

(Step 0 : Install Node & Yarn if you don't have them)

In a new directory:

yarn init
yarn add puppeteer

Create index.js with this:

const puppeteer = require('puppeteer');
(async() => {
    const url = 'https://connect.garmin.com/modern/activity/1915361012';
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Go to URL and wait for page to load
    await page.goto(url, {waitUntil: 'networkidle'});
    // Wait for the results to show up
    await page.waitForSelector('.page-title');
    // Extract the results from the page
    const text = await page.evaluate(() => {
        const title = document.querySelector('.page-title');
        return title.innerText.trim();
    });
    console.log(`Found: ${text}`);
    browser.close();
})();

Result:

$ node index.js 
Found: Daily Mile - Round 2 - Day 27