TextSTAT - Simple Text Analysis Tool

© 2001/2002 - Matthias Hüning

Version 1.51 (01.12.2002)

1. Introduction
2. Creating Your Own Corpus
3. Save Corpus / Open Corpus
4. Internet Corpus Tool
5. Word Forms
6. Search / Concordance
7. Citation
8. Searching With Regular Expressions
9. Print Results / Export / Save
10. History Of The Program
11. Contact

1. INTRODUCTION

TextSTAT is a concordance program which was designed to be user friendly and provide simple Internet functionality. Texts can be combined to form corpora (which can also be stored as such). The program analyses these text corpora and displays word frequency lists and concordances to search terms. The program is written in Python and offered here as a Windows program. TextSTAT is freeware.

With TextSTAT you can search any amount of text you like (well, not quite – the amount is limited by your RAM). You learn how often a certain word occurs or in what contexts it is used. Word combinations can also be examined.

TextSTAT is at the moment available in three languages – German, English and French. You can select the language via the menu entry 'Sprache ändern / Change language' (under 'Options'). To activate the new setting you will have to restart the program.

The program has been tested under Windows 98 (SE) and Windows XP. However, it should work on other Win32 versions.

[ TOP ]

2. CREATING YOUR OWN CORPORA

When you open TextSTAT, you will see a window with a menu bar and several 'tab sheets'. In the foreground is the tab sheet 'Corpus'. You can now add files and, in this way, put together a corpus. In addition to this, you have the following options:
- 'Add file' (via the menu entry or button)
- 'Add HTML page' (via the menu entry or button)

In both cases, it is essential that the file to be added consist of plain text without any formatting. In other words, MS Word files, for example, cannot be read in – such texts have to be converted first of all. As a rule, such a plain text will be available as an ASCII text (encoded as 'Latin-1' – that is also the default encoding for HTML pages). This is the default setting for TextSTAT. A different encoding can also be used. This, however, has to be set up prior to the file being read in (via the menu entry: 'Options > File encoding'). TextSTAT processes the texts internally in Unicode format.

HTML files can be read in directly from the Internet or from your own hard disk. In the first case, the complete WWW address (= URL) has to be entered including 'http://' and, of course, you have to have access to the Internet... If you choose the second option, you simply have to enter the file name including the path (e.g. 'c:\directory\file.htm'). The HTML codes are removed from the files by default. This can, however, be deactivated.

[ TOP ]

3. SAVE CORPUS / OPEN CORPUS

You can save the opened files so that you can use them again as a corpus at a later stage (via the appropriate button and/or menu entry). You can decide the name of the file that is then created. We recommend storing the corpora in a separate folder.

[ TOP ]

4. INTERNET CORPUS TOOL

If you want to put files from the Internet together to form a corpus, then the TextSTAT Corpus Tool can be useful. It enables any number of WWW pages from a website or any number of postings in a newsgroup to be downloaded. You start this help program via the menu entry 'Corpus > New corpus (web/news)'.

If you enter a URL at 'Web', this will be taken as the starting point - the links will be followed and the pages found will also be added to the corpus. You should make sure that the seek area is limited to the server or the appropriate subdirectory. For example, if you take <http://www.onzetaal.nl/advies/index.html> as the starting point, then you will receive for the first option pages beginning with <http://www.onzetaal.nl/>. For the second option, however, you will only receive pages beginning with <http://www.onzetaal.nl/advies/>. You can also specify the file encoding of the website(s) again here. As a rule, 'Latin-1', the default, will be the correct setting.

With 'news', you will first of all have to specify a news server to which you would like (to be able/allowed) to have access. Then, the name of a newsgroup (available on this server) will have to be specified and the number of reports/postings that should be read in (e.g. 500). By default, the quotes in the reports are removed (= lines beginning with '>'). News reports are always regarded as being 'Latin-1' encoded.

[ TOP ]

5. WORD FORMS

After compiling a corpus from one or several files or after loading an existing corpus, you can obtain frequency information on the word forms contained in the corpus with the help of the tab sheet: 'Word Forms'. To do this, click on the button: 'Analyze corpus'.

By default, all the words are converted to the lower case and then displayed in order according to decreasing frequency. You can have the word forms analyzed not only in lower or upper case, but also in different forms. This, however, causes problems when the words are put in alphabetical order since capital letters precede small letters. Retrograde sorting enables you, for example, to answer a question on which words in the corpus have a particular suffix. You can also limit the frequency range to be displayed. Here you should take into account that '0' means no restrictions (therefore: if min.=0 and max.=0, all word forms will be displayed). After the display options have been changed, you will have to 'Update list'.

If you double-click on a word form, then it will be searched for in the corpus and a concordance will be created.

[ TOP ]

6. SEARCH/CONCORDANCE

The tab sheet: 'Search/Concordance' shows a word form or a keyword in context. The terms found can be sorted according to different criteria, and the length of the context to be displayed can be determined. The search term is displayed in upper case by default. This marking can be deactivated.

When you enter a search string, it will be assumed by default that a word has been entered. This setting: search for 'whole words only' can be deactivated. A new search and/or a change in the display options can be activated with the button 'Search/Update'.

When searching, you can use regular expressions (see below).

If you double-click on a line of text, this will be searched for in the corpus and the citation (a text passage with more context) will be displayed.

[ TOP ]

7. CITATION

The tab sheet: 'Citation' will display a text passage in which the sought string will be shown in more context. Moreover, the name of the file from which the passage is taken, will also be displayed. The position (in characters) of the passage in the original file will be given in brackets.

A double-click on the file name opens the original file with the program that is linked with the file extension. In the case of websites, you are connected with the Internet and see the original file displayed in the browser.

[ TOP ]

8. SEARCHING WITH REGULAR EXPRESSIONS

When defining the search term (in 'Search/Concordance'), you can use so-called 'regular expressions'. Admittedly, these are not particularly user friendly, but extremely powerful. They allow you to define even very complex search requests. The most important special characters are included below:

'.'
(the dot) stands for any character you like
'\w'
stands for any alphanumeric character
'\W'
stands for any non-alphanumeric character (e.g. space, punctuation marks)
'+'
the preceding character is repeated once or any number of times
'*'
the preceding character is repeated any number of times, including zero
'*?', '+?'
make sure that '*' and '+' are not 'greedy' (see examples)
'|'
stands for or
'[]'
square brackets define a set of characters which are searched for alternatively.

Examples:

b\wr
finds 'but', 'bit', 'bet' and 'bat'
b\w+r
finds 'but', 'bit', 'bet', 'bat', 'boat' and 'built'
w[ao]nder
finds 'wander' and 'wonder'
(this|that)
finds 'this' or 'that'
so.+e
finds the string 'sold me her house' in the text: 'My sister sold me her house'
so.+?e
finds the string 'sold me' in the text: 'My sister sold me her house'
s.+r
finds the string 'sister sold me her' in the text: 'My sister sold me her house'
s\w+r
finds the string 'sister' in the text: 'My sister sold me her house'

As already stated, regular expressions are not easy, but extremely powerful. The examples shown here can only hint at the possibilities. Much more is possible!! A search with Google for 'tutorial regular expressions' will give you a list of useful websites.

[ TOP ]

9. PRINT RESULTS / EXPORT / SAVE

Word forms and concordances can be directly transferred to an MS Word document. There, they can be processed and also printed out (TextSTAT does not allow results to be printed out directly). The entry 'Results > MS Word' in the 'File' menu opens the word processing program with an empty document and transfers word forms and concordances to this document.

In addition/alternatively, TextSTAT offers you the chance to save the results in a text file. (The encoding of the text depends on the setting under 'Options > File encoding').

Finally, TextSTAT offers you the possibility to export your frequency data to MS Excel directly.

[ TOP ]

10. HISTORY OF THE PROGRAM

First experimental version: September 2000
Version 0.8: July 20, 2001: - websites can be added; - an opened corpus can be saved
Version 0.9: July 24, 2001: - improved process for removing HTML code from websites; - now any number of corpora can be saved (instead of only one)
Version 1.0: July 26, 2001: - first 'public version'; - HTML pages now also read from the hard disk; - removal of HTML codes improved once more; - default font for interface changed to Verdana; - menu entry with link to homepage added
Version 1.1: 14/08/2001: - sort functions changed: now locale.strcoll() used and sorting according to the rules of the language (of the operating system); - the file in the citation window can now be opened with a double-click; - several HTML files can now be added simultaneously; - the corpus is now no longer managed like a dictionary but like a list (owing to the sequence etc.). As a result, however, the corpora stored in the previous version can no longer be used… :-(; - some options are saved upon 'Exit'
Version 1.2: December 08, 2001: - the corpus is no longer reanalyzed each time a file is added (takes too long); - 'Corpus Tool' added (Web Spider, News Grabber -> Corpus); - 'Statistics' removed (since meaningless...); - 'Progress Bar' added; - string module replaced by string methods
Version 1.2a: December 12, 2001: Bugfixes:; - corpora can now be combined; - 'Analyze corpus' no longer counts twice...
Version 1.3: January 11, 2002: - the program now works (internally) with Unicode. As a result, texts can now also be processed in different encoding from Latin-1; in each case, though, the file encoding has to be stipulated when the file is being read in (see new menu item 'Options > File encoding'). The corpus tool also converts everything with Unicode; - language of the program can now be changed (Options > Change language); - new option in the corpus tool – seek area of the spider can now be changed (server or subdirectory); - first version of a documentation created (= this text)
Version 1.4: February 20, 2002: - option 'Save results' added (encoding depends on the option 'File encoding'); - option 'Results > MS Word' added – if Word is available on the system, the program is launched, and the contents of the tab sheets are transferred to an empty document; - option 'Results > MS Excel' added – if Excel is available on the system, the program is launched, and the contents of the Word Forms sheet are transferred to an empty document; - in the frequency list of the word forms, the frequency range to be displayed can now be limited
Version 1.5: October 07, 2002: - new option: add all files of a directory to a corpus; - some bug fixes
Version 1.51: December 01, 2002: - bug fixes

[ TOP ]

11. CONTACT

If you have any questions about (or problems with) TextSTAT, you can contact the author:

Matthias Hüning, <mhuening@zedat.fu-berlin>

Download and information on the program: TextSTAT-Homepage

Last changes to this text on 20.02.2002 - MH

TextSTAT - Simple Text Analysis Tool

© 2001/2002 - Matthias Hüning

Version 1.51 (01.12.2002)

CONTENTS

1. INTRODUCTION

2. CREATING YOUR OWN CORPORA

3. SAVE CORPUS / OPEN CORPUS

4. INTERNET CORPUS TOOL

5. WORD FORMS

6. SEARCH/CONCORDANCE

7. CITATION

8. SEARCHING WITH REGULAR EXPRESSIONS

9. PRINT RESULTS / EXPORT / SAVE

10. HISTORY OF THE PROGRAM

11. CONTACT