You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
3 years ago | |
---|---|---|
.. | ||
README | 3 years ago | |
ptb.test.txt | 3 years ago | |
ptb.train.txt | 3 years ago | |
ptb.valid.txt | 3 years ago |
README
Data description:
Penn Treebank Corpus
- should be free for research purposes
- the same processing of data as used in many LM papers, including "Empirical Evaluation and Combination of Advanced Language Modeling Techniques"
- ptb.train.txt: train set
- ptb.valid.txt: development set (should be used just for tuning hyper-parameters, but not for training)
- ptb.test.txt: test set for reporting perplexity
- ptb.char.*: the same data, just rewritten as sequences of characters, with spaces rewritten as '_' - useful for training character based models, as is shown in example 9