Using PowerGrep and ParaConc to Process English and Chinese Texts

 

Zhu Yubin

I..... Instructions for the cleaning-up and tagging of English texts. 1

II.... Instructions for the cleaning-up and tagging of Chinese texts. 1

III... Tagging for ParaConc. 2

 

 

I.     Instructions for the cleaning-up and tagging of English texts

1. Put the cursor in the search box of PowerGrep, press Enter key four times and there will be four ‘Ά’s in a vertical line in the search box, while in the replacement box, put

                                                               </P>Ά

                                                               <P>

in it.  (In this way, all the paragraph endings will be tagged with </P> and their following new paragraphs will be tagged with <P>.)

 

2. In the texts, if a line was separated by two ‘Ά’s from the following new line, put two ‘Ά’s in the search box, and then put one ‘•’ (Press the whitespace key) in the replacement box. (All the unnecessarily separated lines in one paragraph will be combined together.)

 

3. If all the sentences are separated by two whitespaces in the English texts, put ‘••’ (Press the whitespace key twice) in the search box, and then put </S>•<S> in the replacement box. 

 

4. In the search box, put </P> while in the replacement box, put </S>•</P>.

 

5. In the search box, put <P> while in the replacement box, put <P>•<S>.

 

With the above five steps, all the sentences in the texts will begin with <S> and end with </S>, and all the paragraphs start with <P> and finish with </P>.

 

6. Check the beginning and ending part of the text and revise those mistaken tags if necessary.

 

II.    Instructions for the cleaning-up and tagging of Chinese texts

1. Put

                                                                      Ά

                                                                      ••••

(Press Enter key once and whitespace four times.)

in the search box, and then put

                                                                      </P>Ά

                                                                      Ά

                                                                      <P>

in the replacement box.

 

2. Put

”^[\w|]

in the search box, and put

”</S>•<S>.

in the replacement box.

(In PowerGrep’s regular expressions, ^ stands for exclusion, while | stands for option.)

 

3.1 Put

^[”]

in the search box, and put

</S>•<S>

in the replacement box.

 

3.2 Put

^[”]

in the search box, and put

</S>•<S>

in the replacement box.

 

3.3 Put

^[”]

in the search box, and put

</S>•<S>

in the replacement box.

 

4. Change <P> with <P>•<S>.

 

5. Change </P> with </S>•</P>.

 

6. Use TextPro to combine separate lines in paragraphs.

 

Note: In the above instructions, • stands for one space, that is, press whitespace key once. TextPro is provided by

 

III.   Tagging for ParaConc

With the above steps, all the sentences and paragraphs are delimited with tags. If we want to use ParaConc to align and search the parallel texts, the following steps are necessary.

 

1. Change </S>•<S> to

                                                                      </S>•Ά

                                                                      Ά<S>

2. Change <P>•<S> to

                                                                      <P>•Ά

                                                                      Ά<S>

 

3. Change </S>•</P> to

                                                                      </S>•Ά

                                                                      Ά</P>

 

4. Human intervention is necessary for the correct alignment.