Final assignment
Note: I have not yet put the information about week 15 on the web
site. Sorry for that, I will do it really soon. The shell scripts that I
showed you on Wednesday is on Hagen, under /users1/birot/Examples-week14.
If you have any further questions, please send me an email.
About the final test
The final test is on Tuesday January 20th, at 2 p.m. sharp, in the
Unix lab (H12.102).
As I explained you in class, the final test will be composed of two parts:
- A written test, on paper. Nothing can be used.
- A task to be solved on the computer, based on your solution of the
the final assignment. This means that you have to have your solution
with you (either on the machine, or on a disquette). You can use
any non-human help: your own books, your own notes, the web,... But you
cannot interchange information: borrowing your books or your notes
to somebody else, sending emails, etc.
Final assignment
Write a shell script that receives file names as its argument (any
number of them), and that returns the type-token ratio of each
file.
- The input files are seen as text files. If this is not the case,
the script is not supposed to return any error message. This is the problem
of the user.
- It would be nice, if your solution was behaving like many filters
that we have encountered (cat, rev, grep, sed...). That is, your script
was able to do the following:
if no argument is given, then the script uses standard input
as the input file. What I suggest to you is to write your program without
this extra; and once you are done with it, check whether this extra is
indeed satisfied automatically. If so, try to understand why. (If not,
do not worry.)
- If you create intermediate files, delete them at the end. Tips: 1.
Do not delete the intermediate files, as long as your script does not work
perfectly. Analyzing the intermediate files will help you to debug your
program. 2. When you create intermediate files, give them name such that
no other files occurring in your directory should be damaged.
- Eliminate lines that contain only upper case letters (such as the
author's name in the Federalist papers): these words should not count into
the type-token ratio.
- Eliminate case differences: "the" and "The" belong to the same type.
- Eliminate punctuation: "therefore" and "therefore," belong to the
same type.
- Keep the hyphens, but only if it connects a compound word: "back-up"
is one type, different from "backup", and it is not two words
("back" + "up").
- Keep the numbers: "1776" is to be seen as a word; "3-meter-long"
is a different type from "2-meter-long".
In addition: if the first argument is "-b" then the script should not return
the type-token ratio of the words, but the type-token ratio of the bigram
on the word-level. For instance, the sentence
"Joe likes the colour green, and Judith
likes the colour red." consists of 10 (or 11) bigram tokens, but the
bigram type "likes the", as well as the bigram type "the colour" are
both represented twice (two tokens).
Remark: I really ask you not to speak with each other about
the assignment. You should be able to solve it, if you have solved the
weekly assignments, as well as if you paid attention on the last week's
lecture. If you have any problems, please, rather ask me.
Good luck with the assignment!