Phonotactic Density

17 thoughts
last posted Feb. 6, 2015, 4:22 a.m.
1
get stream as: markdown or atom
0

Moby does include a hyphenation dictionary which might be a crude proxy for syllabification but of course, it's based on spelling, not pronunciation.

0

Even without syllabification, though, it would be interesting to check the trigram density and compare it with my earlier results based on spelling in the FreeBSD word list.

0

The Moby list referenced by zdsmith looks useful but still doesn't have syllabification so would require that to be done.

0

The latter link gives a figure of 15,831 syllables but, as pointed out on stack exchange, has some problematic inclusions.

Still, the rough numbers talked about in the stack exchange post seem to suggest on the order of 3,000 one-syllable English words (although one person claims 10,000).

0

A sub-question is: what proportion of possible syllables are one-syllable words in the given language.

0

Now, as stated above, this is not the original question, which was to do with how many phonotactically possible words in a language are actually words in a language (for a given language).

0

Note, the /usr/share/dict/words files contains a lot of words (mostly proper nouns?) that aren't English so this should not be taken as a measure for English, just a proof of concept.

0

This gives the results:

1000 bigrams
23875 predicted trigrams
11231 trigrams
0.470408376963 density

(Yes, there were exactly 1,000 bigrams)

0

I wrote up a quick program to look at trigram density using the FreeBSD wordlist on OS X:

0

A related, but not identical problem is what proportion of trigrams that you might think would be possible from the possible bigrams are actually possible (and so on for higher-grams).

0

My sister pointed out to me that you'd need a word list with pronunciation for it to be phonotactic (as opposed to graphotactic) density.

0

My plan was to write a quick script that took an online word list and simultaneously derived the phonotactic rules and calculated the coverage.

0

Coincidentally, I was just thinking about this exact thing a couple of weeks ago in the context of Greek roots.

I thought I'd start exploring English first and the entry on my todo list reads:

calculate density of English words amongst the phonotactically possible

so quite amazing to see zdsmith talking about the same thing with the same words :-)

repost from Language by zdsmith
0

I suppose you could also call it phonotactic density.

repost from Language by zdsmith
0

That is, for all the possible lemmas according to a language's phonotactics, how many of them actually exist in the language?

repost from Language by zdsmith
0

I have often wondered about what I'll call for lack of a better term 'phonotactic coverage'.