Ruby web scraping

I want to share my recent experience with ruby web scraping and RSS.

Often I found myself checking the comments for my game iBox3D using http://www.appcomments.com. But the problem is that you can only display the comments by country and my app does so far only have comments in Germany, USA and UK. But the website does not show where you have comments or not. It was hard to say, if there were new comments or not. So I decided to write a small ruby script which should compile all comments and display them. The script evolved as I continued to optimize it:

  • v1: I used WATIR to automate the internet explorer and to get the HTML from the website
  • v2: I used the built in RSS lib to fetch the RSS
  • v3: I used simple-rss to fetch the RSS
  • v4: I used simple-rss to fetch the RSS and used threads to do that for all countries simultaneously

This post will handle following topics:

  • Ruby 1.9
  • Watir
  • RSS with the build-in RSS lib and simple-rss
  • utf8 vs. ASCII
  • Threading
  • JRuby and Ruby

Version 1: Watir

Watir is a lib that helps to automate the browser. For those who want to install Watir at Ruby 1.9 you need to install the ruby devtools and then install the gem with the –platform=ruby option. Here is the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
require 'watir'
 
### hide or show the internet explorer
$HIDE_IE = true
 
# declare some variables
source_url = "http://appcomments.com/app/iBox3D"
comments = Hash.new
country_links = Array.new
 
### open the internet explorer and navigate to the source url
ie = Watir::IE.new()
ie.goto(source_url)
 
### fetch the links to the different countries
ie.links.each do |link|
	if link.href =~ /iBox3D\?country=(\d*)$/
		country_links < < link.href
	end
end
 
### loop over all country links
country_links.each do |link|
	# go to the country specific link
	ie.goto(link)
 
	# due to unknown reasons sometimes the DIV was not there => reload
	until  ie.div(:id => "review_dropdowns").exists? do
		sleep(1)
		ie.goto(link)
		sleep(5)
	end
 
	# get the DIV with id = review_dropdowns
	review = ie.div(:id => "review_dropdowns")
	# get the country text from the SPAN in the DIV
	country = review.divs.first.spans.first.text
	# loop over all DIVs
	ie.divs.each do |d|
		# check the DIV class, if it has "comment", then proceed
		if d.class_name.downcase.match(/\bcomment\b/i)
			# initialize a hash for the comment information
			inf = Hash.new
			# the first link in the div is the title/header
			inf[:header] = d.links[1].text
			# the second link in the div is the user
			inf[:user] = d.links[2].text				
			# loop over all DIVs in the DIV and find the comment_right class (rating)
			d.divs.each do |star_div|
				if star_div.class_name.downcase.match(/\bcomment_right\b/i)
					# count the start image tags
					counter = 0
					star_div.images.each do |img|
						counter += 1
					end
					inf[:stars] = counter
				end
			end
			# get the description of the rating
			inf[:text] = d.ps.first.text
			# store the information in the comments hash
			comments[country] = inf
		end
	end
end
 
### show the result
puts comments.inspect

To understand the coding I suggest that you use Firebug or the Internet Explorer Developer Tools to have a look at the HTML of the coding of http://appcomments.com/app/iBox3D. The comments should be enough to understand the coding – at least I hope so. Nevertheless I want to tell you why I did not stop here and continued to search for alternatives:

  • if the html structure changes, the script will not work any more (no interface contract)
  • it takes very long to start of the internet explorer and to visit each page
  • this script is not platform independent, for other platforms you would have to use e.g. firewatir
  • if I would use threads for parallelisation it would be very memory intense, because many IE instances take a lot memory

Then I noticed the RSS feed on appcomments.com.

Version 2: Ruby and RSS

RSS in ruby should be simple:

1
2
3
4
5
6
7
8
9
10
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
 
source = 'http://appcomments.com/rss/376860218?country=143443'
 
content = "" # raw content of rss feed will be loaded here
open(source, :proxy => "http://proxy:8080") do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
[...]

But try to run it! It will fail with the error:


C:/dev/runtime/Ruby191/lib/ruby/1.9.1/rss/rexmlparser.rb:24:in `rescue in _parse': This is not well formed XML (RSS::NotWellFormedError)
#encoding ::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
[...]
Exception parsing
Line: 12
Position: 462
Last 80 unconsumed characters:
< ![CDATA[Brilliantes Spiel fürs iPhone]]>
[...]

German umlauts! (ä,ö,ü) Somehow ruby 1.9, which claims to have UTF-8 support, introduces a lot problems. The german character ü cannot be parsed by the internal rexml. I found the simple solution to force ruby to think that the content string is utf-8, using the force_encoding method. Then parsing works.

1
2
3
4
5
6
7
8
9
10
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
 
source = 'http://appcomments.com/rss/376860218?country=143443'
 
content = "" # raw content of rss feed will be loaded here
open(source, :proxy => "http://proxy:8080") do |s| content = s.read end
content.force_encoding('utf-8')
rss = RSS::Parser.parse(content, false)

Version 3: Simple-RSS

Before I found that out I tried the ruby lib simple-rss, which could parse the ü (german umlaut). Nevertheless I had to do the same trick as above when I wanted to access the parsed content. At this point I want to introduce the next evolution step of my script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
require 'rubygems'
require 'simple-rss'
require 'open-uri'
 
$base_html = < <EOF
<?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>@@title</title>
</head>
<body>
EOF
 
$base_html_2 = < <EOF
</body>
</body></html>
EOF
 
def get_content(url)
	content = String.new
	open(url, :proxy => "http://proxy:8080") do |s| 
		content = s.read end
	return content
end
 
def utf8(string)
	return string.force_encoding('utf-8')
end
 
source = "http://appcomments.com/app/iBox3D"
countries = get_content(source).scan(/a href='\?country=(\d*)'.+?>(.+?) "http://proxy:8080")
 
	next if rss.items.size == 0
	html < < "<hr/><h1>#{country[1]}</h1>" 
	rss.items.each do |i|
		html < < "<div><h3>#{utf8 i.title}</h3><p>#{utf8 i.description}</p>"
	end
end
 
### write everything to a file
html < < $base_html_2
local_filename = "appcomments.html"
File.open(local_filename, 'w:utf-8') do |f|
	f.write(html)
end

Besides the advantages listet below I also added a file output to an HTML file. This is because the RSS description tag contains HTML which can be easily dropped to a HTML file.

Advantages:

  • RSS is a standardized protocoll, so the structure won’t change in future (interface contract)
  • instead of opening the browser to perform the scraping open-uri is used, which is faster and not so memory consuming
  • this script should run on many platforms, including linux, mac os x and windows
  • the process of making the web request is way easier

Version 4: Threading

Still my script took very long. No wonder, it had to make a GET request per country and one additional for the overview site to get all the country codes. But using ruby built-in threads it is easy to make all requests parallel! This speeds up the whole script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
require 'rubygems'
require 'simple-rss'
require 'open-uri'
 
### define the HTML basis
$base_html = < <EOF
<?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>@@title</title>
</head>
<body>
EOF
 
$base_html_2 = < <EOF
</body>
</body></html>
EOF
 
### get request to a URL
def get_content(url)
	content = String.new
	open(url, :proxy => "http://proxy:8080") do |s| 
		content = s.read end
	return content
end
 
### forces the encoding in UTF-8
def utf8(string)
	RUBY_PLATFORM == 'java' ? string : string.force_encoding('utf-8')
end
 
### define the basic URLs and the filename
source_url = "http://appcomments.com/app/iBox3D"
rss_url = "http://appcomments.com/rss/376860218?country="
local_filename = "appcomments.html"
 
### get the main page to get all countries
countries = get_content(source_url).scan(/a href='\?country=(\d*)'.+?>(.+?)
threads = []
countries.each do |country|	
	threads < < Thread.new do
		# construct the URL with the number of the country
		url = rss_url + country[0]
		# get the RSS feed from the URL
		rss = SimpleRSS.parse open(url, :proxy => "http://proxy:8080")
		# construct the HTML with the country information and the review
		country_html = String.new		
		# go to the next country if there are no reviews
		next if rss.items.size == 0
		# construct the country header
		country_html < < "<hr/><h1>#{country[1]}</h1>" 
		# construct a div for each review
		rss.items.each do |i|
			country_html < < "<div><h3>#{utf8 i.title}</h3><p>#{utf8 i.description}</p>"
		end
		# set the thread variable for later access
		Thread.current["html"] = country_html
	end
end
 
### join the threads and construct the HTML
threads.each do |t|
	# join the threads
	t.join
	# construct one big HTML chunk out of the small HTML junks
	html < < t["html"] unless t["html"].nil?
end
 
### write everything to a file
html << $base_html_2
File.open(local_filename, 'w:utf-8') do |f|
	f.write(html)
end

By the way, this code works also with JRuby, and the only source I had to adjust was the following line:

1
RUBY_PLATFORM == 'java' ? string : string.force_encoding('utf-8')

JRuby is able to handle the utf-8 format way better.

One thought on “Ruby web scraping”

  1. I want know if its best Ruby for web scraping than PHP, cause I have really bad time w PHP triing to scraping javascript content, ite more easy scrap javascript content w ruby ??

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>