A Corpus Worker’s Toolkit

 

© Hongyin Tao 2005

 

A Corpus Worker's Toolkit (ACWT) is a collection of NoteTab clips, Perl scripts and other utilities for Chinese and English text processing. They can do some cheap and dirty corpus/discourse linguistic work for those who can otherwise not afford sophisticated yet expensive commercial software programs. Most of these tools function like macros in word processing programs, but they can do much more and work in a relatively simple text processing environment.

 

Major tools included in the Toolkit so far:

Text Utilities 文本处理

ú      Merge Files

ú      HTML<-->Text Conversion

ú      Tagged Text --> Plain Text Conversion

ú      File comparison/sizes/counts/split/join

ú      Character Spacing/Word Segmentation/POS Tagging

 

Search & Analysis 检索统计

ú       Basic Chinese Concordance

ú       Basic English Concordance

ú       Word List/Frequency

ú       Mutual Info/T-Scores/Z-Score/Log-likelihood

ú       Normed Freq/Ratio/Lexical Density

 

Interactive Text Tagging 互动加码

ú       L2 Errors - The CLEC Tags

ú       Discourse Structure - Samples

ú       Semantics & Pragmatics - Samples

ú       Sociolinguistics - Samples

ú       Syntax - Samples

 

Discourse Transcription 口语转写

ú       The Du Bois (updated) System (Aug-2005)

ú       Header Info

ú       Intonation Units/Sequence

ú      Manners: Voice/Prosody

ú       Metatranscription

Ø       User Guide (Eng, Chin-GB, Chin-Big5)

Ø       Download

Ø       Screen shots

Ø       Support forum: http://www.corpus4u.com

Ø       Email: < ht_ling at sbcglobal dot net >

Ø       Last updated: 08-Sept-2005

 

Ø       Upgrade Information (Sept-07-2005)