July 27, 2016 @ 17:18

Scraping pages that use _doPostBack()

Someone, somewhere, once had this great idea to handle page links with JavaScript. This is how many pages on the web that use asp.net generate their html.

Inside a button or pagination link, there will be a snippet of code called __doPostBack() that sends a response to the server. The content you get back can still be static HTML, but your browser will need JS enabled for this second request to work.

This method of handling arguments from the client to the server is harder than it needs to be, but that's a whole other rant I won't have here.

The reason this is a problem

If your browser doesn't handle JS (I'll admit there aren't many of those) this is an issue.

If you're wanting to scrape that website, this use of JavaScript is a pain, but it can be done.

Let's say we're trying to get all the content from those pages behind those paginated links.

The trouble with most scraping frameworks (like requests in Python and Mechanize in Ruby) is they are headless browsers. That means, they don't emulate JavaScript. There is a way to render JavaScript by using Selenium webdriver and PhantomJS.

Luckily for us, we don't have to do that since the only thing this __doPostBack() function does is send a request with arguments back to the server. It's truly something you could do with a query string. Indeed, why not just use a query string... no, I said no rants.

An example of a website that uses asp.net and __doPostBack()

As an example, let's consider this planning applications website.

If you're looking at this in your browser, to demonstrate what I'm talking about, try turning off JavaScript and then click one of the paginated links at the bottom. You won't receive anything that looks like useful content after clicking the link.

If you inspect one of those paginated links, however, you'll see this:

a href="javascript:__doPostBack('ctl00$Content$cusResultsGrid$repWebGrid$ctl00$grdWebGridTabularView','Page$1')">1</a>

This is the function that passes arguments back to asp.net to retrieve the table.

Using Ruby and Mechanize to emulate __doPostBack()

I needed to write a scraper for this website.

We can solve this using Ruby and Mechanize, and not worrying about JavaScript.

Mechanize has the functionality to enter arguments into an aspnetForm on the same page.

require 'mechanize'

url = "https://eproperty.wyndham.vic.gov.au/ePropertyPROD/P1/eTrack/eTrackApplicationSearchResults.aspx?Field=S&Period=L28&r=P1.WEBGUEST&f=%24P1.ETR.SEARCH.SL28"
agent = Mechanize.new
#Get the page content
page = agent.get(url)

def get_page_content(page)
  table_rows = page.search("table#ctl00_Content_cusResultsGrid_repWebGrid_ctl00_grdWebGridTabularView tr")
  puts table_rows
end

#get first page content
get_page_content(page)

form = page.form("aspnetForm")
form.add_field!('__EVENTARGUMENT', 'Page$2')
form.add_field!('__EVENTTARGET', 'ctl00$Content$cusResultsGrid$repWebGrid$ctl00$grdWebGridTabularView')
page = agent.submit(form)

#get second page content, now we've submitted the fom
get_page_content(page)

This retrieves the page for us, fills in the asp.net form (what the JS function does) and sends the request back to the server.

That wasn't so hard was it? But honestly, I can't understand why people can't just use a query string with arguments. I said I wouldn't get into a rant, so I'll leave it you with this question:

Why do people make things more complicated than they need to be?

Jason Thomas

I like to make stuff

Scraping pages that use _doPostBack()

The reason this is a problem

An example of a website that uses asp.net and __doPostBack()

Using Ruby and Mechanize to emulate __doPostBack()