Read Information From a Website That Uses Javascript

Client-side web scraping with JavaScript using jQuery and Regex

by Codemzy

Client-side web scraping with JavaScript using jQuery and Regex

gcxvePjAjVax5nkRXLCABF4q5e-tJIUyTiZA

When I was edifice my showtime open-source project, codeBadges, I thought it would exist like shooting fish in a barrel to become user profile data from all the principal code learning websites.

I was familiar with API calls and get requests. I thought I could just use jQuery to fetch the data from the diverse API's and use it.

                var name = 'codemzy';                $.get('https://api.github.com/users/' + name, function(response) {                      var followers = response.followers;});              

Well, that was easy. But it turns out that not every website has a public API that you can just catch the information yous want from.

zTDeTtBr0aKKlCARmCqeZ1fnRGso-ewAdAnd
404: API not found

But simply because there is no public API doesn't mean you need to give upwards! You can use spider web scraping to grab the data, with just a little extra work.

Permit's see how we can use client-side web scraping with JavaScript.

For an example, I will take hold of my user information from my public freeCodeCamp contour. Merely you lot tin use these steps on whatsoever public HTML page.

The showtime step in scraping the information is to grab the full page html using a jQuery .go request.

                var proper name = "codemzy";$.get('https://www.freecodecamp.com/' + name, part(response) {  console.log(response);});              

Crawly, the whole page source code just logged to the console.

Note: If yous become an error at this stage along the lines of No 'Access-Control-Allow-Origin' header is present on the requested resources don't fret. Whorl down to the Don't Let CORS Stop You section of this mail.

That was easy. Using JavaScript and jQuery, the above lawmaking requests a page from www.freecodecamp.org, like a browser would. And freeCodeCamp responds with the folio. Instead of a browser running the lawmaking to display the page, we become the HTML code.

And that'south what web scraping is, extracting data from websites.

Ok, the response is not exactly equally dandy as the information we get back from an API.

u8HicVak3E1qSTq9eYYtyTdHlbmylt6zkw4A

Just… nosotros have the data, in there somewhere.

Once nosotros have the source code the information we demand is in in that location, we just have to grab the data nosotros demand!

We tin search through the response to observe the elements we need.

Let'due south say we desire to know how many challenges the user has completed, from the user contour response we got back.

At the time of writing, a camper'southward completed challenges completed are organized in tables on the user profile. So to become the full number of challenges completed, nosotros can count the number of rows.

One way is to wrap the whole response in a jQuery object, then that we tin can use jQuery methods like .find() to get the information.

                // number of challenges completedvar challenges = $(response).find('tbody tr').length;              

This works fine — we get the right effect. But its is not a good style to get the event nosotros are after. Turning the response into a jQuery object actually loads the whole page, including all the external scripts, fonts and stylesheets from that page…Uh oh!

We need a few bits of information. We really don't demand the page the load, and certainly not all the external resources that come up with it.

We could strip out the script tags and and then run the rest of the response through jQuery. To practise this, nosotros could utilise Regex to look for script patterns in the text and remove them.

Or ameliorate still, why non utilize Regex to detect what nosotros are looking for in the offset place?

                // number of challenges completedvar challenges = response.replace(/<thead>[\s|\Southward]*?<\/thead>/g).match(/<tr>/g).length;              

And it works! Past using the Regex code to a higher place, we strip out the table head rows (that did not incorporate any challenges), and then match all tabular array rows to count the number of challenges completed.

It'southward even easier if the data you want is just at that place in the response in evidently text. At the time of writing the user points were in the html similar <h1 form="apartment-tiptop text-principal">[ 1498 ]</h1> only waiting to be scraped.

                var points = response.match(/<h1 form="flat-tiptop text-principal">\[ ([\d]*?) \]<\/h1>/)[one];              

In the to a higher place Regex blueprint nosotros friction match the h1 element nosotros are looking for including the [ ] that surrounds the points, and group any number inside with ([\d]*?). We go an array dorsum, the first [0] element is the entire match and the second [1] is our group match (our points).

Regex is useful for matching all sorts of patterns in strings, and it is bully for searching through our response to get the information we demand.

You tin use the aforementioned 3 step procedure to scrape contour data from a variety of websites:

  1. Use client-side JavaScript
  2. Utilise jQuery to scrape the data
  3. Apply Regex to filter the data for the relevant information

Until I hit a problem, CORS.

2uAuWzx-z3PkOAM5ESjcfsVdghMHXZStBlcu
CORS: Admission Denied

Don't Allow CORS Cease You!

CORS or Cross-Origin Resources Sharing, can be a real problem with client-side spider web scraping.

For security reasons, browsers restrict cross-origin HTTP requests initiated from within scripts. And considering nosotros are using client-side Javascript on the front end for web scraping, CORS errors can occur.

Hither'south an example trying to scrape profile data from CodeWars…

                var proper noun = "codemzy";$.get('https://www.codewars.com/users/' + proper noun, function(response) {  console.log(response);});              

At the time of writing, running the in a higher place code gives you a CORS related error.

If at that place is noAccess-Control-Allow-Origin header from the identify yous're scraping, you can run into bug.

The bad news is, y'all need to run these sorts of requests server-side to get around this issue.

Whaaaaaaaat, this is supposed to be client-side web scraping?!

The expert news is, thanks to lots of other wonderful developers that accept run into the aforementioned issues, you don't have to touch the back end yourself.

Staying firmly inside our front end script, we can use cross-domain tools such as Whatsoever Origin, Whatever Origin, All Origins, crossorigin and probably a lot more. I have found that you lot often demand to examination a few of these to discover the one that will work on the site yous are trying to scrape.

Back to our CodeWars instance, nosotros can send our request via a cross-domain tool to bypass the CORS issue.

                var proper noun = "codemzy";var url = "http://anyorigin.com/go?url=" + encodeURIComponent("https://www.codewars.com/users/") + name + "&callback=?";$.become(url, office(response) {  console.log(response);});              

And only like magic, we have our response.

mEAI00WFdz-ExqkAA2PYZKPsFusHpqn-yYXG

Acquire to code for costless. freeCodeCamp's open source curriculum has helped more than xl,000 people go jobs as developers. Become started

fergusonsauty1960.blogspot.com

Source: https://www.freecodecamp.org/news/client-side-web-scraping-with-javascript-using-jquery-and-regex-5b57a271cb86/

0 Response to "Read Information From a Website That Uses Javascript"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel