Tapestry Central: ruby

Showing posts with label ruby. Show all posts

Monday, April 27, 2009

Ruby and CouchDB

I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.

I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.

#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'


# Extension to Hpricot

class Hpricot::Elem
  # content at
  # if block given, return value comes from yielding
  # actual content to the block
  def cat(expr)
    result = self.at(expr)

    return nil unless result

    content = result.inner_html.fixup

    return (yield content) if block_given?

    return content
  end

  # content-at converted to int (via to_i)

  def cati(expr)
    self.cat(expr) { |content| content.to_i }
  end

  def search_to_text(expr)
    self.search(expr).map { |e| e.inner_text.fixup}
  end
end

$BGG = "http://boardgamegeek.com"


$catalog_pages = 0
$game_pages = 0
$games_added = 0

# Access to couch db.  From what I can tell, the Databsae is multithreaded.

$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")

# The thread pool.

$POOL_SIZE = 10

$QUEUE = Queue.new

workers = (1..$POOL_SIZE).map do |i|
  Thread.new("worker #{i}") do
    begin
      proc = $QUEUE.deq
      proc.call()
    end until $QUEUE.empty?
  end
end

def enqueue &action
  $QUEUE << action
end


def parse_browser_page(url)
end

class String
  def fixup()
    self.gsub("&#039;", "'").gsub("&amp;", "&")
  end
end

def parse_and_load_game(game_id)

  $game_pages += 1

  page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))

  bg = page.at("//boardgame")

  ratings = bg.at("//ratings")

  comments = (bg/"comment").map do |e|
    {
            "user" => e[:username].fixup,
            "comment" => e.inner_text.fixup
    }
  end

  doc = {
          "_id" => game_id,
          "title" => bg.cat("name[@primary='true']"),
          "description" => bg.cat("description"),
          "designers" => bg.search_to_text("boardgamedesigner"),
          "artists" => bg.search_to_text("boardgameartist"),
          "publishers" => bg.search_to_text("boardgamepublisher"),
          "published" => bg.cati("yearpublished"),
          "categories" => bg.search_to_text("boardgamecategory"),
          "mechanics" => bg.search_to_text("boardgamemechanic"),
          "images" =>{
                  "url", bg.cat("image"),
                  "thumbnailUrl", bg.cat("thumbnail"),
                  },
          "players" => {
                  "min" => bg.cati("minplayers"),
                  "max" => bg.cati("maxplayers"),
                  "age" => bg.cati("age")
          },
          "stats" => {
                  "rank" => ratings.cati("rank"),
                  "averageRating" => ratings.cat("average") { |content| content.to_f },
                  "ownedCount" => ratings.cati("owned")
          },
          "comments" => comments
  }

  enqueue do
    $games_added += 1

    $DB.save_doc(doc)
  end

end


def process_game(game_id)

  begin
    doc = $DB.get(game_id)

    # Found, do nothing

  rescue RestClient::ResourceNotFound

    # Not in the database yet, so fire off a request to parse its page.

    enqueue { parse_and_load_game game_id }

  end
end

def parse_catalog_page(url)

  puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
       $game_pages, $games_added, $QUEUE.length])

  doc = Hpricot(open(url))

  $catalog_pages += 1


  doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
    href = elem[:href]
    game_id = href.split('/').last()
    enqueue { process_game game_id }
  end

  next_page_link = doc.at("//a[@title='next page']")

  return unless next_page_link

  next_page = next_page_link[:href]

  # Add this last, to give the other actions a chance to operate.
  # A better mechanism would be a priority-based queue, where the catalog
  # page parse action is lower priority than the other actions.

  enqueue { parse_catalog_page($BGG + next_page) }

end


# Kick it off

start_page = $ARGV[0] || ($BGG + "/browse/boardgame")

enqueue { parse_catalog_page start_page }

# Wait for all the workers to complete

workers.each { |th| th.join }

puts "BoardGameGeek loader complete."

puts "Queue not empty!" unless $QUEUE.empty?

Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.

Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.

The code is naive about some threading issues and simply crashes if there's an error. Oh well.

Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.

Friday, February 20, 2009

Ruby script to rebuild Eclipse .classpath

I'm in the process of rebuilding my Tapestry training workshop to use Eclipse instead of IDEA. One part of this is that I use the Maven Ant tasks to handle dependencies from an Ant build file. The end result are folders of JAR files:

lib/runtime: Runtime JARs
lib/runtime-src: Corresponding source JARs (where available)
lib/test: Test-only JARs
lib/test-src: Corresponding source JARs

With Tapestry + Hibernate, that's like 20 JAR files ... way too many to maintain by hand (especially across five similar projects). I mean, the Eclipse API for this is not designed for frequent changes ... I should be able to point Eclipse at a directory and say "take everything you find" (like I can in IDEA).

I tend to chase the Tapestry release, updating the Workshop project as each new Tapestry release is available. I can't be doing this by hand all the time!

What to do, what to do ... ah, a Ruby script! I've actually gotten very rusty with Ruby, so there's probably easier ways to do this, but the basics are there. I love that Ruby has lots of ways of quoting text that aren't intrusive:

#!/usr/bin/ruby
# Rebuild the .classpath file based on the contents of lib/runtime, etc.

# Probably easier using XML Generator but don't have the docs handy

def process_scope(f, scope)
 # Now find the actual JARs and add an entry for each one.
 
 dir = "lib/#{scope}"

 return unless File.exists?(dir)
 
 Dir.entries(dir).select { |name| name =~ /\.jar$/ }.sort.each do |name|
   f.write %{  <classpathentry kind="lib" path="#{dir}/#{name}"}
   
   srcname = dir + "-src/" + name.gsub(/\.jar$/, "-sources.jar") 

   if File.exist?(srcname)
      f.write %{ sourcepath="#{srcname}"}
   end
   
   f.write %{/>\n}
 end
end

File.open(".classpath", "w") do |f|
  f.write %{<?xml version="1.0" encoding="UTF-8"?>
<classpath>
  <classpathentry kind="src" path="src/main/java"/>
  <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
}

 process_scope(f, "runtime")
 process_scope(f, "test")
 
 f.write %{
  <classpathentry kind="lib" path="src/main/resources"/>
  <classpathentry kind="output" path="target/classes"/>
</classpath>
}
    
end

I integrated this script into my Ant build file, so that it runs after I pull down dependencies.

Once I get to the new workshop session on testing, I may need to update this to take account of testing (including compiling src/test/java to target/test-classes, etc.). Still, 45 minutes of tweaking (which would have been 10 minutes if I wasn't so rusty) will save me lots of time in the future, dividends payable immediately.

Thursday, January 10, 2008

Creating new T5 projects the easy way

The Maven archetype for Tapestry projects is really useful for getting up and running quickly.

However, that's not what I use day-to-day. That command line is so long and ugly!

I use the following Ruby script as a wrapper around Maven:

#!/usr/bin/ruby

require 'optparse'

$group = nil
$artifact = nil
$package = nil
$archetypeVersion = nil
$version = "1.0.0-SNAPSHOT"
$offline = false

$opts = OptionParser.new do |opts|
  
  opts.banner = "Usage: new-project.rb [options]"
  
  opts.on("-g", "--group GROUP", "The group id for the new project") do |value|
    $group = value
  end

  opts.on("-a", "--artifact ARTIFACT", "The artifact for the new project") do |value|
    $artifact = value
  end
  
  opts.on("-p", "--package PACKAGE", "The root package for source code in the new project") do |value|
    $package = value
  end
  
  opts.on("-v", "--version VERSION", "The version number of the new project") do |value|
    $version = value
  end
  
  opts.on("-o", "--offline", "Execute Maven in offline mode") { $offline = true }
  
  opts.on("-V", "--archetype-version VERSION", "Version of the Tapestry quickstart archetype") do |value|
    $archetypeVersion = value
  end
  
  opts.on("-h", "--help", "Help for this command") do
    puts opts
    exit
  end
end

def fail(message)
  puts "Error: #{message}"
  puts $opts
  exit
end


begin
  $opts.parse!
rescue OptionParser::InvalidOption
  fail $!
end

fail "Unexpected command line argument" if ARGV.length > 0
fail "Must specify group" unless $group
fail "Must specify artifact" unless $artifact

$package = $package || "#$group.#$artifact"

command = ["mvn"]

command << "-o" if $offline

command << [
  "archetype:create",
  "-DarchetypeGroupId=org.apache.tapestry",
  "-DarchetypeArtifactId=quickstart",
  "-DgroupId=#$group",
  "-DartifactId=#$artifact",
  "-DartifactVersion=#$version",
  "-DpackageName=#$package"]

if $archetypeVersion
  command << "-DarchetypeVersion=#$archetypeVersion"
end

command = command.join ' '

exec command

A typical usage is thus: new-project -g com.example -a myapp ... and we're off and running.

Saturday, September 15, 2007

Mac OS X bundles vs. Subversion

If you work on Mac OS X, you may have noticed how cool Macs deal with complex documents, things like Keynote presentations or applications themselves. They're stored as directories. The Finder hides this, making them look and act like individual files. This works nicely, often the contents of a bundle are simple text and XML files ... whereas the equivalent under Windows is either a very proprietary (and potentially fragile) binary format, or multiple files and folders that YOU have to treat as a unit.

Alas, this all breaks down when using Subversion. You can't just check in MyPresentation.key into SVN ... it will create those pesky .svn directories inside the bundle, and those will be destroyed every time you save your presentation.

My solution to this is to convert the bundles into an archive, and check in the archive. The bundle folders are marked as svn:ignore. I guess this reveals that I mostly use SVN as a safe, structured backup.

In any case, manually creating those archives can be a pain. So ... out comes my solution to many problems: Ruby.

The goal here is to find bundles that need to be archived; do it efficiently (only update the archive if necessary) and do it recursively, seeking out bundles in sub-directories.

#!/usr/bin/ruby

# Used to prepare a directory for commit to Subversion. This is necessary for certain file types on Mac OS X because what appear to be files in the Finder
# are actually directories (Mac uses the term "bundle" for this concept). It is useless to put the .svn folder inside such a directory, because it will
# tend to be deleted whenever the "file" is saved.  
#
# Instead, we want to compress the directory to a single archive file; the bundle will be marked as svn:ignore.
#
# We use tar with Bzip2 compression, which is resource intensive to create, but 
# compresses much better than GZip or PKZip.
#
# The trick is that we only want to create the acrhive version when necessary; when 
# the archive does not exist, or when any file
# in the bundle is newer than the archive.

require 'optparse'

# Set via command line options

$extensions = %w{pages key oo3 graffle}
$recursive = true
$dry_run = false

# Queue of folders to search (for bundles)

$queue = []

def matching_extension(name)
  dotx = name.rindex('.')
  
  return false unless dotx
  
  ext = name[dotx + 1 .. -1]
  
  return $extensions.include?(ext)
end


# Iterate over the directory, identify bundles that may need to be compressed and (if recursive) subdirectories
# to search.
#
# path: string path for a directory
def search_directory(dirpath)
  
  Dir.foreach(dirpath) do |name|
    
    # Skip hidden files and directories
    next if name[0..0] == "."
    
    path = File.join(dirpath, name)
        
    next unless File.directory?(path)
                  
    if matching_extension(name)
      update_archive path
      next
    end
    
    if $recursive
      $queue << path
    end
    
  end
  
end


def needs_update(bundle_path, archive_path)
  
  return true unless File.exists?(archive_path)
  
  archive_mtime = File.mtime(archive_path)
  
  # The archive exists ... can we find a file inside the bundle thats newer?
  # This won't catch deletions, but that's ok.  Bundles tend to get completly
  # overwritten when any tiny thing changes.
  
  dirqueue = [bundle_path]

  until dirqueue.empty?
    
    dirpath = dirqueue.pop
    
    Dir.foreach(dirpath) do |name|
      
      path = File.join(dirpath, name)
      
      if File.directory?(path)
        dirqueue << path unless [".", ".."].include?(name)
        next
      end
      
      # Is this file newer?
      
      if File.mtime(path) > archive_mtime
        return true
      end
      
    end
    
  end
  
  return false
end

def update_archive(path)
  archive = path + ".tar.bz2"
  
  return unless needs_update(path, archive)

  if $dry_run
    puts "Would create #{archive}"
    return
  end

  puts "Creating #{archive}"
    
  dir = File.dirname(path)
  bundle = File.basename(path)
    
  # Could probably fork and do it in a subshell
  system "tar --create --file=#{archive} --bzip2 --directory=#{dir} #{bundle}"

end

$opts = OptionParser.new do |opts|
  
  opts.banner = "Usage: prepsvn [options]"

  opts.on("-d", "--dir DIR", "Add directory to search (if no directory specify, current directory is searched)") do |value|
    $queue << value
  end

  opts.on("-e", "--extension EXTENSION", "Add another extension to match when searching for bundles to archive") do |value|
    $extensions << value
  end
  
  opts.on("-N", "--non-recursive", "Do not search non-bundle sub directories for files to archive") do
    $recursive = false
  end
  
  opts.on("-D", "--dry-run", "Identify what archives would be created") do
    $dry_run = true
  end
  
  opts.on("-h", "--help", "Help for this command") do
    puts opts
    exit
  end
end

def fail(message)
    puts "Error: #{message}"
    puts $opts
end

begin
    $opts.parse!
rescue OptionParser::InvalidOption
    fail $!
end

# If no --dir specified, use the current directory.

if $queue.empty?
  $queue << Dir.getwd
end

until $queue.empty? 
  search_directory $queue.pop
end

I do love Ruby syntax, it is so minimal, and lets me follow my personal mantra less is more.

I'm sure there's some edge cases that aren't handle well, such as spaces in path names and maybe issues related to permissions. But it works for me.

You do need to have tar installed, in order to build the archives. I can't remember if that's built in to Mac OS X (probably) or whether I obtained it using Fink.

In any case, you need to remember to execute prepsvn in your workspace, to spot file bundles that need archiving, before you check in. It would be awesome if Subversion supported some client-side check-in hooks to do this automatically.

Friday, August 31, 2007

The Blindness of James Gosling: Java as A First Language

I think Java is an excellent all-around-language. It is truly ubiqitous, well specified, highly performant and well accepted by the industry.

But there's a big difference between those credentials, and the credentials of the first language a developer learns. Regardless, James Gosling feels obligated to recommend Java as the first language anyone learns.

Java as a first language? Please! Yes, Java is simpler than C++, C, Lisp, assembler and all the languages that seasoned veterans, such as James Gosling, are familiar with. But the fact that Java shields you from pointers, malloc() and zero-terminated strings does not make it a good first language!

Java is extremely monolithic: in order to understand how to run a simple Hello World program, you'll be exposed to:

Classes
Java packages
Static vs. instance methods
Source files vs. compiled classes
Editing vs. Execution
Using a compiler or an IDE
Method return types and method parameter types
The magic (for newbies) that is "System.out.println()"

... in fact, this list is far from exhaustive. A lot of that has fallen off our collective radar ... we're blind to it from seeing it so much over the last ten years or more.

What's important to understand is that people new to programming don't really have a way of understanding the difference between a document and a program, between source code and an application. Certainly the web (with HTML and JavaScript all mixed together) hasn't helped people understand the division intuitively. It's very hard for any of us to think like someone new to our entire world.

I'm rarely in a position to teach people how to program, but when I think about it, only two languages make sense to me: Ruby and Inform.

Ruby, because it is interactive yet fully object oriented. You can start with an interactive puts "Hello World" and quickly work up from there. You get to interact with the Ruby world as objects before you have to define your own objects. There's a nice clean slope from one-off quickies to eventual discipline as a true developer, without any huge leaps in the middle.

Inform, used to create interactive text adventures, is also intriguing. It's specifically designed for non-programmers, and has a laser-focused UI. It is interactive: you type a little bit about your world and hit Run to play it. Again, you experience an object oriented environment in an incredibly intuitive way long before you have to define to yourself, or to the language, what an object is.

Inform is truly some amazing software, given that advanced Inform games will make use of not just object oriented features, but also aspect oriented. Here's a little example I brewed up about a room built on top of a powerful magnet:

"Magnet Room" by Howard Lewis Ship

A thing is either magnetic or non-ferrous. Things are normally non-ferrous.

The Test Lab is a room. "A great circular space with lit by a vast array of ugly
flourescent lights.  In the center of the room is a white circular pedestal with a
great red switch on top. Your feet echo against the cold steel floor."

The red switch is a device in the test lab. It is scenery.

The double-sided coin is in the test lab. "Your lucky double sided half dollar rests
on the floor." The coin is magnetic.  Understand "dollar" as the coin.

A rabbits foot is in the test lab. "Your moth-eaten rabbits foot lies nearby."

Magnet Active is a scene.

Magnet Active begins when the red switch is switched on. 

When Magnet Active begins: say "A menacing hum begins to surge from beneath the
floor."

Magnet Active ends when the red switch is switched off.

When Magnet Active ends: say "The menacing hum fades off to nothing."

Instead of taking a thing that is magnetic during Magnet Active: say "[The noun]
is firmly fixed to the floor."

That's not a description of my program; that's the actual program. It's literate; meaning equally readable, more or less, by the compiler and the author. It's still a formal language with all that wonderful lexical and BNF stuff, it's just a bit richer than a typical programming language.

What I like about Inform is that you get a complete little game from this:

Magnet Room
An Interactive Fiction by Howard Lewis Ship
Release 1 / Serial number 070831 / Inform 7 build 4W37 (I6/v6.31 lib 6/11N) SD

Test Lab
A great circular space with lit by a vast array of ugly flourescent lights.  In the
center of the room is a white circular pedestal with a great red switch on top. Your
feet echo against the cold steel floor.

Your lucky double sided half dollar rests on the floor.

Your moth-eaten rabbits foot lies nearby.

>take dollar
Taken.

>drop it
Dropped.

>turn switch on
You switch the red switch on.

A menacing hum begins to surge from beneath the floor.

>take dollar
The double-sided coin is firmly fixed to the floor.

>take foot
Taken.

>turn switch off
You switch the red switch off.

The menacing hum fades off to nothing.

>take dollar
Taken.

>

The way rules work, especially in the context of scenes, is very much an aspect oriented approach. It is a pattern-based way to apply similar rules to a variety of objects. I've seen this kind of thing with aspect oriented programming in Java, but also with pattern based languages such as Haskell and Erlang.

The point is, using Ruby or Inform, you can learn the practices of a programmer ... dividing a complex problem into small manageable pieces, without being faced with a huge amount of a-priori knowledge and experience in order to get started. Both Ruby and Inform are self-contained environments designed for quick and easy adoption. That's something Java missed the boat on long, long ago!

Thursday, March 15, 2007

Ruby script for creating new Tapestry 5 projects

One thing with building lots of demos for upcoming presentations, labs, & etc. is that I'm having to use mvn archetype:create a lot and that command line is just hideous.

So I took a deep breath, stepped back, and wrote a Ruby script, newproj to simplify the process:

#!/usr/bin/ruby

require 'getoptlong'

opts = GetoptLong.new(
  [ "--group", "-g", GetoptLong::REQUIRED_ARGUMENT ],
  [ "--artifact", "-a", GetoptLong::REQUIRED_ARGUMENT ],
  [ "--package", "-p", GetoptLong::OPTIONAL_ARGUMENT ],
  [ "--version", "-v", GetoptLong::OPTIONAL_ARGUMENT ]
)

group = nil
artifact = nil
package = nil
version = "1.0.0-SNAPSHOT"
error = false

begin
  opts.each do | opt, arg |
    case opt
      when "--group" 
        group = arg
      when "--artifact" 
        artifact = arg
      when "--package" 
        package = arg
      when "--version" 
        version = arg
    end
end
rescue GetoptLong::Error
  error = true
end

if error || ARGV.length != 0 || group == nil || artifact == nil
  puts "newproj: --group groupId --artifact arifactId [--package package] [--version version]"
  exit 0
end

package = package || "#{group}.#{artifact}"

command = "mvn archetype:create -DarchetypeGroupId=org.apache.tapestry -DarchetypeArtifactId=quickstart -DarchetypeVersion=5.0.3"
command << " -DgroupId=#{group} -DartifactId=#{artifact} -DpackageName=#{package} -Dversion=#{version}"

puts command

Kernel::exec(command)

I'll eventually add to this; it needs the option to control the mvn -o (offline) flag, and further in the future, the ability to choose the correct archetype (once we add more than just quickstart). But this sure beats copying and pasting out of the documentation, like I've been doing.

Tapestry Central

Tapestry Training -- From The Source