I have often wondered about what I'll call for lack of a better term 'phonotactic coverage'. ---- That is, for all the possible lemmas according to a language's phonotactics, how many of them actually exist in the language? ---- I suppose you could also call it phonotactic *density*. ---- Coincidentally, I was just thinking about this exact thing a couple of weeks ago in the context of Greek roots. I thought I'd start exploring English first and the entry on my todo list reads: > calculate density of English words amongst the phonotactically possible so quite amazing to see zdsmith talking about the same thing with the same words :-) ---- My plan was to write a quick script that took an online word list and simultaneously derived the phonotactic rules and calculated the coverage. ---- My sister pointed out to me that you'd need a word list with pronunciation for it to be phonotactic (as opposed to graphotactic) density. ---- A related, but not identical problem is what proportion of trigrams that you might think would be possible from the possible bigrams are actually possible (and so on for higher-grams). ---- I wrote up a quick program to look at trigram density using the FreeBSD wordlist on OS X: https://gist.github.com/jtauber/e0e861011d91f7005feb ---- This gives the results: ``` 1000 bigrams 23875 predicted trigrams 11231 trigrams 0.470408376963 density ``` (Yes, there were exactly 1,000 bigrams) ---- Note, the `/usr/share/dict/words` files contains a lot of words (mostly proper nouns?) that aren't English so this should not be taken as a measure for English, just a proof of concept. ---- Now, as stated above, this is not the original question, which was to do with how many phonotactically possible words in a language are actually words in a language (for a given language). ---- A sub-question is: what proportion of possible syllables are one-syllable words in the given language. ---- [The Moby](http://icon.shef.ac.uk/Moby/) website includes a link to *175,000 entries fully International Phonetic Alphabet coded*. Therein are two different text files, both of which has English words rendered in a uniform phonetic style. They would go a long way to building a tree of all phonotactic English words. ---- Regarding **number of English syllables**, found: * http://english.stackexchange.com/questions/64506/is-there-a-list-of-syllables-contained-in-us-english * http://semarch.linguistics.fas.nyu.edu/barker/Syllables/index.txt ---- The latter link gives a figure of 15,831 syllables but, as pointed out on stack exchange, has some problematic inclusions. Still, the rough numbers talked about in the stack exchange post seem to suggest on the order of 3,000 one-syllable English words (although one person claims 10,000). ---- The Moby list [referenced by zdsmith](https://thoughtstreams.io/zdsmith/phonotactic-density/7501/) looks useful but still doesn't have syllabification so would require that to be done. ---- Even without syllabification, though, it would be interesting to check the trigram density and compare it with my earlier results based on spelling in the FreeBSD word list. ---- Moby does include a hyphenation dictionary which might be a crude proxy for syllabification but of course, it's based on spelling, not pronunciation.