Consider applying PLSI to the following corpus (each line is a separate document0:
ABACAA BCABABB CACBAB
furthermore, assume that there are two topics, and ABC are the only types that are available.
Now suppose we initially assigned words to topics as above (black for topic 1, red/underline for topic 2). Calculate the topic-word vectors and document-topic vectors.
Use the vectors generated in part (a) to calculate the topic probability for each word in the corpus
Use the result of (b) to recalculate the topic-word vectors and document-topic vectors.
Calculate whether the vectors in (c) is better for the set of documents.
Now consider this corpus (each line is a separate sentence): ABCCC
ADBB CDADD CABB DACB
Suppose we want to build a bigram model based on the corpus above. Assume we have both a begin and end sentence symbol for each sentence.
Calculate the perplexity of each sentence (separately) for each of the two cases
The base case (no smoothing)
Using Laplace (plus 1) smoothing.
Also show the probabilities for each bigram (preferably in a 2-d matrix).