Using Gate to Segment and Tag Chinese Texts

Zhu Xiaomin

Segmenting Chinese Texts (encoded in UTF-8) using Segmenter in Gate 5

Step One: Download Gate 5.1 from https://sourceforge.net/projects/gate/files/gate/5.1/gate-5.1-build3431-installer-win.exe/download

 

Step Two: Install the program

 

Step Three: Go to C:/Program files/GATE-5.1/plugins/Lang_Chinese/resources/models/model-paum-pku-utf8.zip and unzip this file to a location that you can remember

 

Step Four:

       1. Start GATE

       2. File, Open Manage Creole Plugins

       3. Find Lang_Chinese and click in the box under “Load Now”, OK

       4. Processing Resources (Right Click), type in Chinese Segmenter, OK

       5. Language Resources (Right Click), New, GATE Corpus, Name the corpus(for example, Chinesecorpus), OK

       6. Right Click on Chinesecorpus, Populate, Browse to the folder that contains the corpus and add that path to the Directory URL, Click on the pencil symbol, Type txt, Add, OK, Encoding Type utf-8, OK.

       7. Right Click Applications, select Corpus Pipeline, OK

       8. Double Click Corpus Pipeline, Double Click Chinese Segmenter in Loaded Processing Resources, then Chinese Segmenter moves into Selected Processing Resources. Make sure the following is correct:

 

       learningAlg = PAUM

       learningMode = SEGMENTING

       modelURL = model-paum-pku-utf8 (the place where you unzip the file in Step Three)

       textCode = utf8

       textFilesURL = (browse to the corpus folder)

 

       9. Click on Run this Application. This can take some (approximately 5 minutes for 40 texts) time depending on the size of the corpus.

(Provided by Zhu Xiaomin on June 11, 2010)

 

Tagging segmented Chinese Texts (encoded in UTF-8) using Stanford Postagger in Gate 5

       Gait was developed by Cunningham, Hamish et al [The University of Sheffield (http://gate.ac.uk/)]. (2001-2010).

 

Step One: Install Java(JRE) on your computer

       You can download Java from http://sdlc-esd.sun.com/ESD6/JSCDL/jre/6u18-b79/jxpiinstall.exe?AuthParam=1269156422_b6361febd3fd5bf0c616837bde692629&GroupName=JSC&FilePath=/ESD6/JSCDL/jre/6u18-b79/jxpiinstall.exe&File=jxpiinstall.exe&BHost=javadl.sun.com

       Check your Java version:

       1. Click Start

       2. Type cmd and press enter

       3. This will open the command prompt window

       4. Type java –version and press enter

       5. You will get a message: java version “1.6.0_17”

 

Step Two

       Download Standford Postagger from http://nlp.stanford.edu/software/stanford-postagger-full-2010-05-26.tgz

 

Step Three

       Unzip the file to places you are comfortable with using an archive manager software, such as WinRAR, 7-Zip, or WinZip.

       You might want to change the name of this unzipped folder to stanTagger. I do this because the original name is too long: stanford-postagger-full-2010-05-26

 

Step Four

       In stanTagger folder create two folders to hold your files, e.g myCorpus and myTaggedCorpus, Now put some text files (or your corpus) in myCorpus. Make sure there are no spaces in your file names. For example, writtenArgument.txt instead of written Argument.txt

 

Step Five

       1. Start your command window as described in Step One

       2. Go to the folder that contains the Stanford Tagger:

       This is how you do it:

 

       cd places where you unzip the Stanford Postagger\stanTagger

 

       3. Run the program using your command prompt window:

       For tagging one segmented Chinese text:

 

       java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/chinese.tagger -textFile myCorpus\name of the file to be tagged.txt > myTaggedCorpus\name of the file to be tagged(or whatever name you want ).txt

 

       For tagging more than one segmented Chinese texts:

 

FOR %a IN (Place where Stanford Postagger is unzipped\stanTagger\myCorpus\*.txt) DO java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/chinese.tagger -textFile myCorpus\%~nxa >MyTaggedCorpus\%~nxa

 

       4. After typing the script above press enter

(June 11, 2010)