You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
wangchongwu e1da9dd954
push
3 years ago
..
README push 3 years ago
ptb.test.txt push 3 years ago
ptb.train.txt push 3 years ago
ptb.valid.txt push 3 years ago

README

Data description:


Penn Treebank Corpus
- should be free for research purposes
- the same processing of data as used in many LM papers, including "Empirical Evaluation and Combination of Advanced Language Modeling Techniques"
- ptb.train.txt: train set
- ptb.valid.txt: development set (should be used just for tuning hyper-parameters, but not for training)
- ptb.test.txt: test set for reporting perplexity

- ptb.char.*: the same data, just rewritten as sequences of characters, with spaces rewritten as '_' - useful for training character based models, as is shown in example 9