Final assignment


Note: I have not yet put the information about week 15 on the web site. Sorry for that, I will do it really soon. The shell scripts that I showed you on Wednesday is on Hagen, under /users1/birot/Examples-week14. If you have any further questions, please send me an email.



About the final test

The final test is on Tuesday January 20th, at 2 p.m. sharp, in the Unix lab (H12.102).

As I explained you in class, the final test will be composed of two parts:

  1. A written test, on paper. Nothing can be used.
  2. A task to be solved on the computer, based on your solution of the the final assignment. This means that you have to have your solution with you (either on the machine, or on a disquette). You can use any non-human help: your own books, your own notes, the web,... But you cannot interchange information: borrowing your books or your notes to somebody else, sending emails, etc.


Final assignment

Write a shell script that receives file names as its argument (any number of them), and that returns the type-token ratio of each file.

  1. The input files are seen as text files. If this is not the case, the script is not supposed to return any error message. This is the problem of the user.
  2. It would be nice, if your solution was behaving like many filters that we have encountered (cat, rev, grep, sed...). That is, your script was able to do the following: if no argument is given, then the script uses standard input as the input file. What I suggest to you is to write your program without this extra; and once you are done with it, check whether this extra is indeed satisfied automatically. If so, try to understand why. (If not, do not worry.)
  3. If you create intermediate files, delete them at the end. Tips: 1. Do not delete the intermediate files, as long as your script does not work perfectly. Analyzing the intermediate files will help you to debug your program. 2. When you create intermediate files, give them name such that no other files occurring in your directory should be damaged.
  4. Eliminate lines that contain only upper case letters (such as the author's name in the Federalist papers): these words should not count into the type-token ratio.
  5. Eliminate case differences: "the" and "The" belong to the same type.
  6. Eliminate punctuation: "therefore" and "therefore," belong to the same type.
  7. Keep the hyphens, but only if it connects a compound word: "back-up" is one type, different from "backup", and it is not two words ("back" + "up").
  8. Keep the numbers: "1776" is to be seen as a word; "3-meter-long" is a different type from "2-meter-long".

In addition: if the first argument is "-b" then the script should not return the type-token ratio of the words, but the type-token ratio of the bigram on the word-level. For instance, the sentence "Joe likes the colour green, and Judith likes the colour red." consists of 10 (or 11) bigram tokens, but the bigram type "likes the", as well as the bigram type "the colour" are both represented twice (two tokens).



Remark: I really ask you not to speak with each other about the assignment. You should be able to solve it, if you have solved the weekly assignments, as well as if you paid attention on the last week's lecture. If you have any problems, please, rather ask me.



Good luck with the assignment!