Catty v3 - artificial futility ------------------------------ Copyright (C) 2004 by Michal Zalewski http://lcamtuf.coredump.cx/catty.shtml 1) What is this? ---------------- Catty is a novelty AI bot that does not, in fact, make a slightest effort to implement or simulate intelligent behavior. Whereas other bots attempt to analyze language, react to keywords, or do other tricks, Catty does not. Instead, the bot uses web pages around the world - as indexed by Google - as a source of various text it replies with. The exact sentence to be used is found by applying a trivial, language-blind word matching algorithm. This is rather unheard of in the world of AI, but common in the world of politics. As a result, you talk to the Internet. Catty's responses are sometimes incoherent or offensive - just the way the web is - but most of the time, you will be surprised, amused, or feeling you are having a meaningful conversation. As a general rule, you can expect Catty to stand out in that its responses are not canned, repetitive, nor predictable. Catty v3 works on Linux and *BSD systems (and should be fairly portable), and its learning module requires a set of GNU tools (awk, sed, grep, tr, bash), and a properly compiled Lynx browser. 2) What's new and why? ---------------------- Catty v1 (2001) was a trivial sentence-matching bot, originally written for IRC and then ported for WWW. It used fairly mediocre algorithms, and a poor HTML parsing / text selection algorithm. Its knowledge database was, due to performance concerns, limited to circa 200 000 sentences. Catty v2 (2002) used a far more sophisticated set of algorithms, own memory allocator and string comparison routines, and a far better text matcher that favored run-ons of matching words, thus responding more coherently. Its database peaked at 2 000 000 sentences. It became a popular destination on my website, and some folks even used it to write movie scripts ;-) Catty v2+c (2003) used a slightly improved scoring algorithm, and, most importantly, did not fully reset sentence scores in each iteration, implementing somehting that resembled staying on topic. Catty v3 (2004) uses a more structured database of sentences (grouping them by "subjects" and trying to stay on topic). It also deploys a much more selective sentence collector. Algorithms are far from perfect at this point, but it seems to work even better than v2. 3) How do I build and use it? ----------------------------- Simply issue 'make'. When the code is compiled, you can execute ./catty3 and talk to the bot, prefixing every line of your input with ':'. The reason for this requirement is that the bot is primarily intended to be used on the web; if you include a unique keyword prior to :, bot's response would contain that very keyword at the beginning. This allows you to easily feed queries and process responses for a number of concurrent clients from a CGI script, while a single instance of Catty is running in the background. 4) How do I teach it? --------------------- Firstly, you need a fairly recent version of lynx in $PWD or in /usr/bin, custom-compiled by modifying src/LYGlobalDefs.h so that MAX_COLS is defined as 30000, rather than default 999. IT IS STRONGLY RECOMMENDED TO RECOMPILE LYNX WITH THIS TWEAK. Then, assuming you have all the aforementioned GNU utils in place, you should run ./cronman. The script should go through every entry in cron/* directory, and attempt a Google search on the issue, adding results to the knowledge base. A proper output of the script should look the following way: [+] Attempting to learn about 'you think would be on': 25 hits. + http://www.colinthompson.com/page6.htm: 20 + http://www.crescatsententia.org/archives/2004_03_22.html: 180 + Entry count goal achieved, bailing out... `- Total: 200 unique entries. If you see errors instead, chances are, you are missing some tools. Catty v3 adds certain strings to the cron/* directory as it chats with the user, to later broaden its horizons. You can, however, force the bot to learn about a specific issue by doing: touch cron/text+to+be+looked+up Note that plus signs must be used in place of spaces. You are also advised to use all-lowercase, text-only (no punctuation, etc) keywords. The number of words used must be less than MAXKEY (6 by default, see config.h). Use only phrases that are likely to be found by a web search engine, and ones that bear some relevance to the subject of webpages to be found. WARNING: Catty will issue a single Google lookup per every text to be searched for. Although this is usually not a lot of traffic, it is also against Google ToS to run automated lookups (hence lowering their advertisement click-through ratio). We should be using Google API instead - but I am yet to find a sane interface that could be used from a shell script :-( 5) How do I start with a blank database? ---------------------------------------- The default knowledge database used by Catty v3 is a result of learning it off the transcripts for Catty v2. If you are uncomfortable with the quality or maturity of Catty's responses, made any changes to the HTML parsing engine, or just want to build a bot oriented at a specific topic or language, you should start with a blank database. To remove all database entries, do the following: echo -n >data/knowledge >data/learned >data/visited At this point, Catty v3 will refuse to start. You need to manually add several 'seeds' for the database, creating cron/* entries manually (as described in section 4). You should then run ./cronman and let the bot learn. The bot needs to index 1000-2000 topics (around 5000 web pages, 100 000 sentences) on average to be eloquent. With only a couple phrases, it will remain hopelessly clueless, and will resort to generic excuses most of the time. Because entering thousands of topics is rather impractical, one way to grow the database is to manually feed Catty v3 around 20 topics, then run the resulting database itself through the bot: grep '^ ' data/knowledge | awk '{print ": " $0}' | ./catty3 >/dev/null This command may be repeated a number of times, until a count of subjects in cron/* is at around 100 or such. At this point, those should be briefly reviewed if possible, and a next cycle of learning (./cronman) should be started. When it finishes, a next run of "loopback feeding" should yield several hundred phrases to search, and a next learning cycle can be initiated - until a desired number of topics is indexed. 6) What can I tweak? -------------------- There are several parameters you might want to adjust: * Google URL line in ./cronman script - by adding extra parameters to the www.google.com/search?q=... invocation (such as specifying a language, changing the number of returned hits, etc), you can narrow or widen page selection criteria. This is particularly useful if you want the bot to speak only a single language. * AIMFOR variable in ./cronman script - this variable controls the optimal number of sentences the bot will attempt to collect per topic. Keeping it low would make the bot more casual in its responses, keeping it high would usually provide it with more in-depth knowledge. * POPWORDS variable in ./cronman script - this variable must be kept lower than MAXPOP in config.h. See MAXPOP below. * IGNORE variable in ./cronman script - this line lists generic common words that should be ignored when creating a list of most prominent words for every subject (this list is later used to find out what topic the user is talking about). * MAXTOPICS in config.h - the maximum number of topics we plan to have. If knowledge database contains more, it will be refused. The default, 10000, is probably rather hard to exceed, but you might want to lower it for memory-conservative applications. * MAXPHRASES in config.h - the maximum number of sentences we allow per topic. This should be kept above AIMFOR x 2, quite simply - if isn't, you risk the bot will refuse to run. * MAXWORDS in config.h - the maximum number of words per sentence we allow. The default is very generous. If any sentence has more than MAXWORDS words, it will be simply truncated silently. * MAXWLEN in config.h - the maximum length of a single word we are expecting to see. See MAXWORDS. * MAXKEY in config.h - the maximum number of keywords allowed per topic. All entries created in cron/* by Catty v3 will have less than MAXKEY words; all manually created entries must also follow this rule. * MAXPOP in config.h - the maximum number of popular keywords indexed per topic. Popular keywords are used to better approximate what subject the user is talking about. This must be more than POPWORDS defined in ./cronman! * LINEBUF in config.h - the maximum acceptable line length for most operations (including reading sentences, popular words, and user input). * REPEAT_LOG in config.h - the number of sentences that must be said by Catty v3 in between before an already used sentence can be repeated. * KEEP_CTX in config.h - the number of sentences we want to keep to maintain context of current conversation; used only to score individual phrases (topics are scored in a different way, with a residual score fall-through). * KBASE, EXCUSES in config.h - pathnames to data files, obviously. * MAXEXC in config.h - the maximum number of excuses allowed.