A while ago I carefully crafted a google searcher in sed and shell script. Making scripts for doing http requests is cool! (I say carefully because it is a pain to parse HTML with regular expressions
)
The Code
Nokogiri is a ruby library for searching HTML, but a very simple one. On the start page they have a simple script that searches google. It’s beautiful.
But it doesn’t search the whole results, just the first page. So I modified it a bit:
#!/usr/bin/env ruby
# http://iamstealingideas.wordpress.com/2010/05/23/ruby-enumerators-and-ascript-to-list-all-google-search-results/
require 'open-uri'
require 'nokogiri'
class Search
include Enumerable
def initialize(terms)
def escape u
URI.escape u, Regexp.new("[^#{URI::PATTERN::UNRESERVED}]")
end
@terms = terms.map { |u| escape u }.reduce { |a, b| "#{a}+#{b}" }
end
def each
10.times do |n|
url = "http://www.google.com/search?start=#{100*n}&num=100&q=#{@terms}"
doc = Nokogiri::HTML(open(url))
break if n > doc.css('table#nav a.fl').length
doc.css('h3.r a.l').each { |p| yield p['href'] }
end
end
end
fail "Usage: #{$0} terms" if $*.empty?
Search.new($*).each { |p| puts p }
It was tested in Ruby 1.9.
It’s executable: save it and try ./script site:iamstealingideas.wordpress.com
Objects of Search class has the each method, that receives a block and calls it for all search results. (In this case, the block just prints the url, with puts).
But it also has other methods. Let’s suppose your terminal is being flooded with too much results. (Maybe I’m being too optimistic). If you want just the first 7 terms, you may use the first method. Replace the last line of the script with:
Search.new($*).first(7).each { |p| puts p }
(It may make no difference). What about counting the available results?
puts Search.new($*).count
Enumerable classes
The trick here is to include the Enumerable mixin. This will give a lot of methods to the class, all based on your each method. Just like, say, an Array:
[1, 2, 3].each { |p| puts p }
puts [1, 2, 3].count
In fact, it has all this methods, but it doesn’t have, say, []. You may convert it to the class Array using entries. Replacing the last line with:
puts Search.new($*).entries[723]
You will have the 723th result. Yeah, you might try a keyword with more results, like ./search love.
More
If you have trouble understanding the code, you may want to read about blocks and methods, mixins (more on mixins), enumerators. Or asking in the comments
IMO, Ruby seems like a nice substitute to shell script.
Some random things:
- Another way is to subclass the Enumerator class. Mixins seems to have higher precedence at method lookup, so you may prefer it if you want to implement some of those methods yourself.
- That escape code was found there.
- Stopping the search was tricky. I’m counting the number of pages.
- &safe=off&filter=0 is probably useful. Also see this google help.
- Map and reduce is a used a lot in functional programming. (Map is also known as collect; reduce, as inject or fold). Do you think this code is readable?