Ruby web scraping
I want to share my recent experience with ruby web scraping and RSS.
Often I found myself checking the comments for my game iBox3D using http://www.appcomments.com. But the problem is that you can only display the comments by country and my app does so far only have comments in Germany, USA and UK. But the website does not show where you have comments or not. It was hard to say, if there were new comments or not. So I decided to write a small ruby script which should compile all comments and display them. The script evolved as I continued to optimize it:
- v1: I used WATIR to automate the internet explorer and to get the HTML from the website
- v2: I used the built in RSS lib to fetch the RSS
- v3: I used simple-rss to fetch the RSS
- v4: I used simple-rss to fetch the RSS and used threads to do that for all countries simultaneously
This post will handle following topics:
- Ruby 1.9
- Watir
- RSS with the build-in RSS lib and simple-rss
- utf8 vs. ASCII
- Threading
- JRuby and Ruby
Version 1: Watir
Watir is a lib that helps to automate the browser. For those who want to install Watir at Ruby 1.9 you need to install the ruby devtools and then install the gem with the –platform=ruby option. Here is the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | require 'watir' ### hide or show the internet explorer $HIDE_IE = true # declare some variables source_url = "http://appcomments.com/app/iBox3D" comments = Hash.new country_links = Array.new ### open the internet explorer and navigate to the source url ie = Watir::IE.new() ie.goto(source_url) ### fetch the links to the different countries ie.links.each do |link| if link.href =~ /iBox3D\?country=(\d*)$/ country_links < < link.href end end ### loop over all country links country_links.each do |link| # go to the country specific link ie.goto(link) # due to unknown reasons sometimes the DIV was not there => reload until ie.div(:id => "review_dropdowns").exists? do sleep(1) ie.goto(link) sleep(5) end # get the DIV with id = review_dropdowns review = ie.div(:id => "review_dropdowns") # get the country text from the SPAN in the DIV country = review.divs.first.spans.first.text # loop over all DIVs ie.divs.each do |d| # check the DIV class, if it has "comment", then proceed if d.class_name.downcase.match(/\bcomment\b/i) # initialize a hash for the comment information inf = Hash.new # the first link in the div is the title/header inf[:header] = d.links[1].text # the second link in the div is the user inf[:user] = d.links[2].text # loop over all DIVs in the DIV and find the comment_right class (rating) d.divs.each do |star_div| if star_div.class_name.downcase.match(/\bcomment_right\b/i) # count the start image tags counter = 0 star_div.images.each do |img| counter += 1 end inf[:stars] = counter end end # get the description of the rating inf[:text] = d.ps.first.text # store the information in the comments hash comments[country] = inf end end end ### show the result puts comments.inspect |
To understand the coding I suggest that you use Firebug or the Internet Explorer Developer Tools to have a look at the HTML of the coding of http://appcomments.com/app/iBox3D. The comments should be enough to understand the coding – at least I hope so. Nevertheless I want to tell you why I did not stop here and continued to search for alternatives:
- if the html structure changes, the script will not work any more (no interface contract)
- it takes very long to start of the internet explorer and to visit each page
- this script is not platform independent, for other platforms you would have to use e.g. firewatir
- if I would use threads for parallelisation it would be very memory intense, because many IE instances take a lot memory
Then I noticed the RSS feed on appcomments.com.
Version 2: Ruby and RSS
RSS in ruby should be simple:
1 2 3 4 5 6 7 8 9 10 | require 'rss/1.0' require 'rss/2.0' require 'open-uri' source = 'http://appcomments.com/rss/376860218?country=143443' content = "" # raw content of rss feed will be loaded here open(source, :proxy => "http://proxy:8080") do |s| content = s.read end rss = RSS::Parser.parse(content, false) [...] |
But try to run it! It will fail with the error:
C:/dev/runtime/Ruby191/lib/ruby/1.9.1/rss/rexmlparser.rb:24:in `rescue in _parse': This is not well formed XML (RSS::NotWellFormedError)
#encoding ::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
[...]
Exception parsing
Line: 12
Position: 462
Last 80 unconsumed characters:
< ![CDATA[Brilliantes Spiel fürs iPhone]]>
[...]
German umlauts! (ä,ö,ü) Somehow ruby 1.9, which claims to have UTF-8 support, introduces a lot problems. The german character ü cannot be parsed by the internal rexml. I found the simple solution to force ruby to think that the content string is utf-8, using the force_encoding method. Then parsing works.
1 2 3 4 5 6 7 8 9 10 | require 'rss/1.0' require 'rss/2.0' require 'open-uri' source = 'http://appcomments.com/rss/376860218?country=143443' content = "" # raw content of rss feed will be loaded here open(source, :proxy => "http://proxy:8080") do |s| content = s.read end content.force_encoding('utf-8') rss = RSS::Parser.parse(content, false) |
Version 3: Simple-RSS
Before I found that out I tried the ruby lib simple-rss, which could parse the ü (german umlaut). Nevertheless I had to do the same trick as above when I wanted to access the parsed content. At this point I want to introduce the next evolution step of my script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | require 'rubygems' require 'simple-rss' require 'open-uri' $base_html = < <EOF <?xml version="1.0" encoding="utf-8"?> < !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>@@title</title> </head> <body> EOF $base_html_2 = < <EOF </body> </body></html> EOF def get_content(url) content = String.new open(url, :proxy => "http://proxy:8080") do |s| content = s.read end return content end def utf8(string) return string.force_encoding('utf-8') end source = "http://appcomments.com/app/iBox3D" countries = get_content(source).scan(/a href='\?country=(\d*)'.+?>(.+?) "http://proxy:8080") next if rss.items.size == 0 html < < "<hr/><h1>#{country[1]}</h1>" rss.items.each do |i| html < < "<div><h3>#{utf8 i.title}</h3><p>#{utf8 i.description}</p>" end end ### write everything to a file html < < $base_html_2 local_filename = "appcomments.html" File.open(local_filename, 'w:utf-8') do |f| f.write(html) end |
Besides the advantages listet below I also added a file output to an HTML file. This is because the RSS description tag contains HTML which can be easily dropped to a HTML file.
Advantages:
- RSS is a standardized protocoll, so the structure won’t change in future (interface contract)
- instead of opening the browser to perform the scraping open-uri is used, which is faster and not so memory consuming
- this script should run on many platforms, including linux, mac os x and windows
- the process of making the web request is way easier
Version 4: Threading
Still my script took very long. No wonder, it had to make a GET request per country and one additional for the overview site to get all the country codes. But using ruby built-in threads it is easy to make all requests parallel! This speeds up the whole script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | require 'rubygems' require 'simple-rss' require 'open-uri' ### define the HTML basis $base_html = < <EOF <?xml version="1.0" encoding="utf-8"?> < !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>@@title</title> </head> <body> EOF $base_html_2 = < <EOF </body> </body></html> EOF ### get request to a URL def get_content(url) content = String.new open(url, :proxy => "http://proxy:8080") do |s| content = s.read end return content end ### forces the encoding in UTF-8 def utf8(string) RUBY_PLATFORM == 'java' ? string : string.force_encoding('utf-8') end ### define the basic URLs and the filename source_url = "http://appcomments.com/app/iBox3D" rss_url = "http://appcomments.com/rss/376860218?country=" local_filename = "appcomments.html" ### get the main page to get all countries countries = get_content(source_url).scan(/a href='\?country=(\d*)'.+?>(.+?) threads = [] countries.each do |country| threads < < Thread.new do # construct the URL with the number of the country url = rss_url + country[0] # get the RSS feed from the URL rss = SimpleRSS.parse open(url, :proxy => "http://proxy:8080") # construct the HTML with the country information and the review country_html = String.new # go to the next country if there are no reviews next if rss.items.size == 0 # construct the country header country_html < < "<hr/><h1>#{country[1]}</h1>" # construct a div for each review rss.items.each do |i| country_html < < "<div><h3>#{utf8 i.title}</h3><p>#{utf8 i.description}</p>" end # set the thread variable for later access Thread.current["html"] = country_html end end ### join the threads and construct the HTML threads.each do |t| # join the threads t.join # construct one big HTML chunk out of the small HTML junks html < < t["html"] unless t["html"].nil? end ### write everything to a file html << $base_html_2 File.open(local_filename, 'w:utf-8') do |f| f.write(html) end |
By the way, this code works also with JRuby, and the only source I had to adjust was the following line:
1 | RUBY_PLATFORM == 'java' ? string : string.force_encoding('utf-8') |
JRuby is able to handle the utf-8 format way better.
Comments
Powered by Facebook Comments