Wordjumper

Marcus Uneson, maarcus.uneson@ling.lu.se NOTE: delete one of the 'a':s
Dept of Linguistics, Lund University
Index

Q & A

back

What does the program do?
What's the point?
Why do I see misspelt words (...) etc
Why do I see the same word appearing more than once in the path?

What does the program do?

It takes the vocabulary of a language as defined in some wordlist (currently available are English and Swedish only, with some 35,000 words each) and a start word and a target word of your choice. Then it finds a path from start to target via 'orthographic neighbours' , words which may be derived from a given word by replacing (or, optionally, deleting or adding) exactly one letter with another.

You may let the program find the shortest path (or paths, there may be several equally short) by a breadth-first search; or else let it generate a path at random without repeating itself. You may also let it perform all possible derivations of all possible derivations ... of all possible derivations of the start word. In the latter case, the last word will equal the start word.

It may be helpful to think of the words as 35,000 or so cities, connected by some road network, and the derivations as journeys from one city to another. The program will then act as a traveller's planner. It will generate an itinerary from A to B, if at all possible, the shortest possible one or just one at random. And if you so wish, the itinerary thus produced will let you travel each and every road available within the road network exactly once in each direction, ending up where you started.

See also separate page for a little bit more on networks and relations (very basic).

What's the point?

For the user: diversion (maybe). For the author: diversion (definitely). And a coursework including some exercise in network unwinding algorithms.

Why do I see misspelt words, words occurring twice with different spellings, inflected forms of some but not all words, abbreviations, proper nouns, bound prefixes etc

Wordjumper depends on data bases of possible links. The data bases are prepared in advance by letting a simple link analysing program go through some wordlist, investigating all possible exits for each and every word. The completeness and correctness of the data base is therefore entirely dependent on the completeness and correctness of the word list used.

A good wordlist for this application should contain no proper nouns nor abbreviations. It should be also fairly comprehensive (for one thing, not primarily based on word frequency data), lemmatized (containing no inflected forms) and standardized (listing only the most accepted spelling for a given lexeme, if there are variations). Any wordlist, of course, should be reliable with regard to spelling. The ones currently used are free but unfortunately exhibit some deficiencies and inconsistencies in other respects. Hope I'll find something better.

Why do I see the same word appearing more than once in the randomized path?

Because it is not the nodes that are unique and never occur more than once: it is the paths between them.

Marcus Uneson, September 2002