Coincidentally, I was just thinking about this exact thing a couple of weeks ago in the context of Greek roots.
I thought I'd start exploring English first and the entry on my todo list reads:
calculate density of English words amongst the phonotactically possible
so quite amazing to see zdsmith talking about the same thing with the same words :-)
My plan was to write a quick script that took an online word list and simultaneously derived the phonotactic rules and calculated the coverage.
My sister pointed out to me that you'd need a word list with pronunciation for it to be phonotactic (as opposed to graphotactic) density.
A related, but not identical problem is what proportion of trigrams that you might think would be possible from the possible bigrams are actually possible (and so on for higher-grams).
I wrote up a quick program to look at trigram density using the FreeBSD wordlist on OS X:
This gives the results:
23875 predicted trigrams
(Yes, there were exactly 1,000 bigrams)
/usr/share/dict/words files contains a lot of words (mostly proper nouns?) that aren't English so this should not be taken as a measure for English, just a proof of concept.
Now, as stated above, this is not the original question, which was to do with how many phonotactically possible words in a language are actually words in a language (for a given language).
A sub-question is: what proportion of possible syllables are one-syllable words in the given language.
The Moby website includes a link to 175,000 entries fully International Phonetic Alphabet coded. Therein are two different text files, both of which has English words rendered in a uniform phonetic style. They would go a long way to building a tree of all phonotactic English words.
Regarding number of English syllables, found:
The latter link gives a figure of 15,831 syllables but, as pointed out on stack exchange, has some problematic inclusions.
Still, the rough numbers talked about in the stack exchange post seem to suggest on the order of 3,000 one-syllable English words (although one person claims 10,000).
The Moby list referenced by zdsmith looks useful but still doesn't have syllabification so would require that to be done.
Even without syllabification, though, it would be interesting to check the trigram density and compare it with my earlier results based on spelling in the FreeBSD word list.
Moby does include a hyphenation dictionary which might be a crude proxy for syllabification but of course, it's based on spelling, not pronunciation.