You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
11 lines
609 B
11 lines
609 B
3 years ago
|
Data description:
|
||
|
|
||
|
Penn Treebank Corpus
|
||
|
- should be free for research purposes
|
||
|
- the same processing of data as used in many LM papers, including "Empirical Evaluation and Combination of Advanced Language Modeling Techniques"
|
||
|
- ptb.train.txt: train set
|
||
|
- ptb.valid.txt: development set (should be used just for tuning hyper-parameters, but not for training)
|
||
|
- ptb.test.txt: test set for reporting perplexity
|
||
|
|
||
|
- ptb.char.*: the same data, just rewritten as sequences of characters, with spaces rewritten as '_' - useful for training character based models, as is shown in example 9
|