Phonotactic Density

Combined Stream

jtauber • zdsmith

18 thoughts
last posted Feb. 6, 2015, 4:22 a.m.

get stream as: markdown or atom

Phonotactic Density

by jtauber

repost from Language by zdsmith

I have often wondered about what I'll call for lack of a better term 'phonotactic coverage'.

created Feb. 4, 2015, 3:54 p.m.

repost from Language by zdsmith

That is, for all the possible lemmas according to a language's phonotactics, how many of them actually exist in the language?

created Feb. 4, 2015, 3:54 p.m.

repost from Language by zdsmith

I suppose you could also call it phonotactic density.

created Feb. 4, 2015, 3:54 p.m.

Phonotactic Density

by jtauber

Coincidentally, I was just thinking about this exact thing a couple of weeks ago in the context of Greek roots.

I thought I'd start exploring English first and the entry on my todo list reads:

calculate density of English words amongst the phonotactically possible

so quite amazing to see zdsmith talking about the same thing with the same words :-)

created Feb. 4, 2015, 3:57 p.m.

My plan was to write a quick script that took an online word list and simultaneously derived the phonotactic rules and calculated the coverage.

created Feb. 4, 2015, 3:58 p.m.

My sister pointed out to me that you'd need a word list with pronunciation for it to be phonotactic (as opposed to graphotactic) density.

created Feb. 4, 2015, 3:59 p.m.

A related, but not identical problem is what proportion of trigrams that you might think would be possible from the possible bigrams are actually possible (and so on for higher-grams).

created Feb. 5, 2015, 7:20 a.m.

I wrote up a quick program to look at trigram density using the FreeBSD wordlist on OS X:

created Feb. 5, 2015, 7:36 a.m.

This gives the results:

1000 bigrams
23875 predicted trigrams
11231 trigrams
0.470408376963 density

(Yes, there were exactly 1,000 bigrams)

created Feb. 5, 2015, 7:37 a.m.

Note, the /usr/share/dict/words files contains a lot of words (mostly proper nouns?) that aren't English so this should not be taken as a measure for English, just a proof of concept.

created Feb. 5, 2015, 7:39 a.m.

Now, as stated above, this is not the original question, which was to do with how many phonotactically possible words in a language are actually words in a language (for a given language).

created Feb. 5, 2015, 7:43 a.m.

A sub-question is: what proportion of possible syllables are one-syllable words in the given language.

created Feb. 5, 2015, 7:45 a.m.

Phonotactic Density

by zdsmith

The Moby website includes a link to 175,000 entries fully International Phonetic Alphabet coded. Therein are two different text files, both of which has English words rendered in a uniform phonetic style. They would go a long way to building a tree of all phonotactic English words.

created Feb. 6, 2015, 4:01 a.m.

Phonotactic Density

by jtauber

Regarding number of English syllables, found:

created Feb. 6, 2015, 4:07 a.m.

The latter link gives a figure of 15,831 syllables but, as pointed out on stack exchange, has some problematic inclusions.

Still, the rough numbers talked about in the stack exchange post seem to suggest on the order of 3,000 one-syllable English words (although one person claims 10,000).

created Feb. 6, 2015, 4:13 a.m.

The Moby list referenced by zdsmith looks useful but still doesn't have syllabification so would require that to be done.

created Feb. 6, 2015, 4:17 a.m.

Even without syllabification, though, it would be interesting to check the trigram density and compare it with my earlier results based on spelling in the FreeBSD word list.

created Feb. 6, 2015, 4:18 a.m.

Moby does include a hyphenation dictionary which might be a crude proxy for syllabification but of course, it's based on spelling, not pronunciation.

created Feb. 6, 2015, 4:22 a.m.

Phonotactic Density

Combined Stream

Keyboard Help