11 October 2010

Playing with the Wordle algorithm: a tag cloud of Mesh Terms

The paper describing Wordle has been recently published (http://www.research.ibm.com/visual/papers/wordle_final2.pdf ). The algorithm was briefly described: “The most distinctive geometric aspect of a Wordle is the layout algorithm, which packs words to make efficient use of space. While many space-filling visualizations exist, they typically work by recursiveLayout proceeds according to this pseudocode:

sort words by weight, decreasing
for each word w:
w.position := makeInitialPosition(w);
while w intersects other words:
updatePosition(w);

The two key procedures here are "makeInitialPosition" and "updatePosition". The makeInitialPosition routine picks a point at random according to a distribution that takes into account the desired overall shape, and, if desired, alphabetical order. The updatePosition routine moves the word on a spiral of increasing radius, radiating from the word's starting position. The updatePosition routine is aware of constraints on the overall shape of the Wordle. Constraining the layout to a rectangular shape causes updatePosition to prefer positions inside of the strict boundaries of the playing field; a blobby overall shape accepts boundary violations. The rectangular constraint is relaxed when the spiral radius exceeds either playing field dimension. ”
Your browser does not support the <CANVAS> element !


Just for fun , I've implemented my own version of the Wordle Alogrithm. The Java code is available on github at http://github.com/lindenb/jsandbox/blob/master/src/sandbox/MyWordle.java. I won't describe the program here, but I'll just say that the code invokes a java.awt.font.TextLayout class to get the shape of the text:
Graphics2D g=(...)
FontRenderContext frc = g.getFontRenderContext();
Font font=new Font("Dialog",Font.BOLD,fontSize);
TextLayout textLayout=new TextLayout(w.getText(), font, frc);
Shape shape=textLayout.getOutline(null);
and a java.awt.geom.Area to test if two shapes intersects.

Ok, let's test this code. FIrst I'm going to dump a pubmed query as XML with another simple tool named PubmedDump. This XML file is then parsed with a javascript program called from the SAX parser saxstream.jar (previously described here):
mesh.js
importPackage(Packages.sandbox);
importPackage(Packages.java.io);
importPackage(Packages.java.awt);
var content=null;
var mesh2count={};

function startElement(uri,localName,name,atts)
{
if(name=="DescriptorName")
{
content="";
}
}
function characters(s)
{
if(content!=null) content+=s;
}
function endElement(uri,localName,name)
{
if(content!=null)
{
var count=mesh2count[content];
if(count===undefined) count=0;
mesh2count[content]=count+1;
}
content=null;
}
function endDocument()
{
var w= new MyWordle();
for(var s in mesh2count)
{
var word= new MyWordle.Word(s,mesh2count[s]);
w.add(word);
}
w.setUseArea(true);/* use shape area instead of bounding boxes */
w.setAllowRotate(true);
w.setSortType(1);/* sort by weight */
w.doLayout();

var f=new File("result.svg");
w.saveAsSVG(f);
}
This javascript code counts the occurence of each MESH term (<DescriptorName>), and when the document has been parsed, a new instance of MyWordle class is created, filled and the result is saved to a SVG file.

Invocation


Here is an example for the query "Rotavirus NSP3 NSP1":
java -jar pubmeddump.jar "Rotavirus NSP3 NSP1" |\
java -cp mywordle.jar:saxscript.jar org.lindenb.tinytools.SAXScript -n -f mesh.js


Result





That's it,
Pierre

3 comments:

Anonymous said...

Excellent work and very fine analysis!

Anonymous said...

Thanks a LOT! Yours was the only implementation of a word cloud purely in Java that I could find!

Regu said...

Pierre,

Good work there. I was looking for a Java based implementation of tag cloud. Your source code provided a great start. I have created an OSS project for unstructured data analysis called Sift (https://github.com/regunathb/Sift) and tag cloud is one of the supported features.

Thanks
Regu