Category Archives: Linguistics

The Translation Problem: People vs. Computers

In my last post, I introduced the topic of natural language processing and discussed the issue of how the context of a piece of language has an enormous impact on its translation into another language.  In this post, I want to address issue with translation.  Specifically, I want to talk how language is really an integrated function of the way the human brain models the world, and why this might make it difficult to create a machine translator isolated from the rest of an artificial intelligence.

When a human uses language they are expressing things that are based upon an integrated model of the universe in which they live.  There is a linguistic model in their brain that divides up their concept of the world into ideas representable by words.  For example, let’s look at the word “pit bull”.  (It’s written with two words, but as a compound word, it functions as a single noun.)  Pit bull is a generic term for a group of terrier dog breeds.  Terriers are dogs.  Dogs are mammals.  Mammals are animals.  This relationship is called a hypernym/hyponym relationship.  All content words(nouns/verbs/adjectives) are part of a hierarchical tree of hypo-/hyper-nym relationships.

So when you talk about a pit bull, you’re invoking the tree to which it belongs, and anything you say about a pit bull will trigger the conversational participants’ knowledge and feelings about not only pit bulls, but all the other members of the tree to which it belongs.  It would be fairly trivial programming-wise, although possibly quite tedious data-entry-wise to create a hypo-/hyper-nym tree for the couple-hundred-thousand or so words that make up the core vocabulary of English.  But to codify the various associations to all those words would be a lot more difficult.  Such a tree would be a step towards creating both a world-model and knowledge-base, aspects of artificial intelligence not explicitly related to the problem of machine translation.  That’s because humans use their whole brain when they use language, and so by default, they use more than just a bare set of grammar rules when parsing language and translating between one language and another.

One use of such a tree and its associations would be to distinguish between homographs or homonyms.  For example, if the computer sees a word it knows is associated with animals, it could work through the hypernym tree to see if “animal” is a hypernym or association with say, the word horse.  Or, if it sees the word “grain”, it could run through the trees of other words to see if they are farming/crop related or wood-related.  Or, perhaps, crossing language boundaries, if a language has one word that covers all senses of “ride”, and the other language distinguishes between riding in a car, or riding a horse, the program could use the trees to search for horse- or car-related words that might let it make a best guess one which verb is appropriate in a given context.

The long and short of the case I intend to make is that a true and accurate translation program cannot be written without taking enormous steps down the path of artificial intelligence.  A purely rule-based system, no matter how many epicycles are added to it, cannot be entirely accurate, because even a human being with native fluency in both languages and extensive knowledge and experience of translating cannot be entirely accurate.  Language is too malleable and allows too many equivalent forms to always allow for a single definitive translation of anything reasonably complex, and this is why it is necessary to make value judgements based on extra-linguistic data, which can only be comprehensively modeled by using techniques beyond pure grammatical rules.


In the next post, I’ll talk about statistical methods of machine translation, and hopefully I’ll be following that up with a critique and analysis of the spec fic concept of a universal translator.


Tags: , , , , , ,

SpecLing #2: A Language Without Nouns?

Better late than never, I thought I’d talk today about the possibility of a language without nouns.  Last time, I talked about a language without verbs, and delved into what exactly defines a part of speech.  Here’s a quick recap:

  1. Parts of speech can be defined in a few ways: lexically, where a given root is only acceptable as one part of speech; syntactically, where a their location in the sentence and the words surrounding them are applied to the root, and there may be no lexical distinction involved; and morphologically, where a category of roots undergo a specific set of morphological processes.
  2. Nouns are content words, meaning they have a meaning that can exist independently of a sentence.
  3. Verbs and noun roots in English can in fact switch categories.  You can bag your groceries by putting them in a bag, and rope you some cattle with a rope.


There have been several languages and language families put forward as lacking nouns.  Tongan, Riau Indonesian, the Salishan languages of Oregon.  In the case of Riau, it seems words are lexically underspecified–that is, they can be used in any category.  In Salishan languages, you have what is often considered to have a verbal category, while not having a nominal one.  So, the word for “dog” is actually a verb meaning “to be a dog”  The same goes for being a man.  One mans.


A question arises here:  While “man”-ness is a verb syntactically and morphologically in Salishan languages, is it possible to argue that these “verbs” aren’t just nouns by another form?  In the previous paragraph, I used the word “man” as a “verb” in English.  Are such verbs in Salishan merely placeholders for a true noun?  One difference in using verbs as opposed to nouns is the removal of the tedious “to be” constructions in English.  “He is a man.” requires more words than “He mans”.  That brings is back to the issue of the multiple definitions of a part of speech.  Lexically, its reasonable to say a language with such constructions lacks nouns.  Morphologically, if a root undergoes the same processes as words that are verbs, it’s reasonable to conclude it’s a verb.  The only argument to be had in this case is syntactic.  A predicate requires a verb.  If a Salishan pseudo-verb can be a predicate all on its own, then doesn’t that imply it’s actually a bona fide verb?  But verbs must be nominalized to become arguments of another verb, in which case you could argue they aren’t.  Now, the truth is that a noun/verb distinction has never been 100% delineable, so I think it can be argued in good faith that these roots are truly verbs.

In which case, it’s much simpler to conclude that we can have a language without nouns than that we can have a language without verbs.


As far as methods to construct a noun-less grammar, we have:

  1. Stative verbs as in Salishan
  2. I don’t know?  Any suggestions?

Tags: , , ,

The Difference Between Spoken and Written Language: Acronym Edition

Something I’ve noticed recently online is the issue of the indefinite articles: “a” vs. “an”.  Many people probably know the rule for this, and many people probably just do it unconsciously.  Essentially, you have “a” before a word beginning with a consonant sound (not consonant letter!), and “an” before a word beginning with a vowel sound (not vowel letter!).  This kind of thing is called “allomorphy”, made up of the Greek roots(morphemes) “allo”, meaning “other”, and “morph”, meaning “shape”(form).  They have different forms depending on the words around them.

Now, there’s an interesting intersection between written and spoken language here:

1. Often, people taught the rule explicitly put “a” before any written word with an initial consonant letter, and “an” before any written word with an initial vowel letter.  There are a few variations of this.  And with dialects, there can be differences, as well.  The “a historical”/”an ‘istorical” debate is still raging, for example.  And then you have the examples like “an apron”, which was originally “a napron”, but because of the ambiguity in speech, people reanalyzed the morpheme boundary to get our modern usage.  The “an (vowel)-” beginning was just so much more common than the “a n–” combination, so people who were hearing the phrase for the first time just assumed one analysis based on their past experience.

2. The issue of whether an acronym should be read as its individual letters, it’s whole word pronunciation, or the entire phrase that it represents.  For example, should the indefinite article for the new age category in publishing, “New Adult”–acronym “NA”–be written “an NA” or “a NA”.  The first version would be correct if it was being read “en ey”, but the second would be correct if it was read “New Adult”–despite being written in acronym form–and although it doesn’t apply for this case, if “NA” was a true acronym instead of an initialism, you could argue it should be “a nah”.


Personally, I would never read “NA” as “New Adult” out loud, and so seeing “a NA novel” confuses the heck out of me for a second or two.  But other people seem to think that’s a legitimate reading, and who am I o gainsay them?  I wonder how this might apply in an editing situation, where the editor and the writer disagree about which is the proper way to read an acronym.  Or in a critique?

Leave a comment

Posted by on February 11, 2014 in Linguistics


Tags: , , ,

SpecLing #1: A Language Without Verbs?

This is the first in a series of posts on the subject of speculative linguistics, the study of language in a speculative context.  For example, studying constructed languages(conlangs), possible forms of alien communication, languages which violate earthly linguistic universals, etc.  Basically, it’s the application of real-world linguistics to non-real-world linguistic occurrences.

In this post, I’m going to talk about an interesting hypothetical situation involving a human-usable language without verbs.  I am going to get a bit technical, so to start I’ll give a short overview of the issues involved, and a refresher on some basic terms:

Parts of speech:  A verb is a part of speech, along with things like nouns, adjectives, adverbs, etc.  It is generally considered that all human languages have at least two parts of speech, verbs and nouns.  When linguistics study pidgins–contact languages developed by two groups who speak un-related languages–there are almost invariably nouns and verbs, the suggestion being that these two categories are required for human language.

Content words vs. function words:  Verbs, like nouns and adjectives, are “content words”.  That means they contain some inherent meaning.  Function words are things like prepositions and articles, which have a grammatical use, but don’t contain basic concepts like nouns and verbs do.

However, if you look at a verbs, you can see that they do in fact have some similar grammatical elements beyond the basic concept they represent.  Tense, mood, aspect, person, number, etc, are all functions of verbs in various languages.  You can abstract out these features into function words, and in fact some languages do.

Something else to consider is that most languages have a very restricted pool of function words, whereas they can usually contain any number of content words–one for every concept you can devise.  And yet not all languages have the same number or even a similar set of function words.  So the question becomes, could you, by expansion of the categories of function words of various types and with assistance from other content categories, split up the responsibilities of the verb category?

Each part of speech consists, in the most basic sense, of a set of responsibilities for the expression of thought.  The only difference between function words and content words is whether there are some higher concepts overlaid on top of those responsibilities.  Now, there are, to an extent, a finite number of responsibilities to be divided among the parts of speech in a language.  Not all languages have the same parts of speech, either.  This suggests that we can decide a priori how to divide out responsibilities, at least to an extent.  Assuming that a part of speech is merely a set of responsibilities, and knowing that these sets can vary in their reach from language to language, it is possible that we could divide the responsibilities between sets such that there is not part of speech sufficiently similar to the verb to allow for that classification.

Even that conclusion is assuming we’re restricted to similar categories as used by currently known human languages, or even just similar divisions of responsibility.  However, that isn’t necessarily the case.  There are, to my mind, two major ways to create a verb-less language:

1. Vestigial Verbs: As this is a topic and a challenge in language that has interested me for a long time, I’ve made several attempts at creating a verb-less language, and over time, I like to think they have gotten less crude.  One of my early efforts involved replacing verbs with a part of speech I called “relationals”.  They could be thought of as either verbs reduced to their essence, or atrophied over time into a few basic relationships between nouns.  Basically, they are a new part of speech replacing verbs with a slightly different responsibility set, but sharing a similar syntax, otherwise.  I was very much surpsied, then, while researching for this post, to come across a conlang by the name of Kēlen, created by Sylvia Sotomayor.  She also independently developed the idea of a relational, and even gave it the same name.  Great minds think alike?

Although our exact implementations differed, our ideas of a relational were surprisingly similar.  Basically, it’s what it says on the tin, it expresses a relationship between nouns(noun phrases).  However, they have features of verbs, such as valency–the number of arguments required by a verb, and Kelen included tense inflections, to represent time, although my own did not, and rather placed temporal responsibility on a noun-like construction representing a state of being.

An example of a relational, one that appears to be the basic relational for Sotomayor’s Kelen and my own conlang is that of “existence”.  In English we would use the verb “to be”: “there is a cat.”  Japanese has the two animacy-distinct verbs “iru” and “aru”: “Neko ga iru.”  Kelen makes use of the existential relational “la”: “la jacela” for “there is a bowl.”  In my conlang, the existential relational was mono-valent, somewhat equivalent to an intransitive verb, but Kelen can express almost any “to be” construction: “The bowl is red.”: “la jacēla janēla”, which takes a subject and a subject complement, and is thus bi-valent.  In English we have a separate category for these kind of verbs, “linking” verbs, as opposed to classifying them as transitive, but both categories are bi-valent, taking two arguments.

2. No Verbs: Another experiment of mine in a verb-less language took what I consider to be the second approach, which is to simply eliminate the verb class, and distribute its responsibilities among the other parts of speech.  Essentially, you get augmented nouns or an extra set of “adverbial” (though that’s an odd name considering there are no verbs, it’s the closest equivalent in standard part of speech) words/morphemes.  This requires thinking of “actions” differently, since we no longer have a class of words that explicitly describe actions.

My solution was to conceive of an action as a change in state.  So to carry the equivalent of a verbs information load, you have two static descriptions of a situation, and the meaning is carried by the contrast of the two states.  A simple, word-for-word gloss using English words for the verb “to melt” might be a juxtaposition of two states, one describing a solid form of the substance, and the other a liquid form: “ present.water”.  There are all sorts of embellishments, such as a “manner” or “instrumental” clause that could be added: “ present.water instrument.heat”, for example.  (The word after the period is the content word, and before is some grammatical construction expressing case or tense.)


There are probably many more methods of creating a verb-less language.  A relational language would probably be the easiest for the average person to learn, because of the similarity to a verbed language.  However, a statve language doesn’t seem impossible to use, and depending on the flexibility of morphology and syntax in regards to which responsibilities require completion in a given sentence, could be an effective if artificial method of human communication.


Next time, I’m going to consider the possibility of a noun-less language.  I’ve never tried one before, and honestly I don’t have high hopes for the concept.  Especially if it had normal verbs.  How would verb arguments be represented in a language without nouns?  Well, that’s really a question for the next post.

If anyone has some thoughts on the usability of a verb-less language, or the structure, or can recommend me some natlangs or conlangs that eschew verbs, I’d love to hear about it in the comments.


Posted by on November 11, 2013 in atsiko, Conlanging, Linguistics, Speculative Linguistics


Tags: , , , , ,

Linguistics and SFF: Appropriation and Dialect

Last time on Linguistics and SFF: Orthography and Etymology

An oft-debated topic in all fiction is the subject of using dialect as dialogue.  Many famous writers have done it, and many not-so-famous writers have tried it, to varying degrees of success.  Since dialects are a very linguisticky topic, I thought I’d take a look at why and how writers use them, some of the effect of using them, and how it all relates to the whole debate on cultural appropriation.

First, a few thoughts on dialect:


1. A dialect is a unique language system characteristic of a group of speakers.

2. A dialect is a variety of a major language carrying connotations of social, cultural, or economic subordination to the culture which speaks the dominant language.

These two definitions exist simultaneously.  For our purposes, the second one is the most relevant.

Dialects under the second definition are culturally, socially, and economically stigmatized by the dominant culture.  Speaking a dialect is often portrayed by dominant cultural institutions as just “bad [Dominant Language]”.  For example, “bad English”. (We’ll keep this example, since I’m discussing literature from primarily English-speaking countries.)  Many children are taught in school to speak “proper English” in school, and punished for using their native dialect.  “Proper English” actually takes a few forms:  In America, it is a dialect known as “Standard American English(SAE)”, which is most similar to Midwestern dialects of American English.  In England, there is Received Pronunciation(RP).

Most other countries with official “languages”, have a similar pattern of official and unofficial dialects.  What is considered a language is often up to whichever dialects can get state support, and it has been said that “A language is merely a dialect with an army and a navy.”  Or, in the case of France, “a dialect with a national Academy.”

Almost all other dialects are usually considered inferior or degraded versions of the official dialect.

So, onto the use of dialect in fiction.

For the most part, dialogue in English-language novels is written in the standard form of written English, which reflects more or less the standard form of spoken English in the country in which it is printed.  Although, depending on the orthography used, this reflection could be rather cloudy or warped.  Dialect, then, is represented in an attempt at “phonetic” spelling and non-standard vocabulary and grammar.

Most commonly, because the author does not often speak the dialect natively that they are attempting to represent, dialect in fiction falls back on stereotypes of usage related to the cultural perception of the spoken dialect.  This can lead to a continuation of prejudice and stereotypes, and is also a form of linguistic and cultural appropriation, as a member of the dominant culture makes use of minority culture for their own ends.  Rarely in the cases we’re examining are these ends malicious.  But they are often still quite problematic.

There are many English dialects that have been popularized in mass culture, with varying degrees of difference presented.  For example: Italian American English, Chinese American English, African American English, Cockney English, Appalachian English, and Southern English.  In fact, they are so parodied, mocked, and appropriated that they have “accents” associated with them.  The cheesy Italian accent a la Mario, the “Oy Guvnah” of Cockney, and “tree dorra” of Chinese American English.

Some of these “dialects” are actually accents or inter-languages, rather than stable dialects.  However, they are all commonly referred to as “dialect” (or occasionally “accent”) in regards to their representation in fiction.  And for the most part, rather than actual depictions of the stated dialect, what is really present is the set of stereotypical markers associated with the dialect by mainstream culture.

Next time, I’ll look at some examples, both made-up and used in novels, of dialect appropriation.

Next time on Linguistics and SFF: Artemis Fowl and the Eternity Code

1 Comment

Posted by on August 9, 2013 in Cultural Appropriation, Linguistics


Tags: , , , , , , ,

Linguistics and SFF: Orthography and Etymology

Last time on Linguistics and SFF: Orthography and Vowel Systems

In this post, I’ll be discussing orthography and etymology.  Etymology is the study of the history of words–basically tracing the forms of a word through time all the way back to its origins.  Since I’m writing in English, I’ll mostly be using English examples for etymology.  And that’s great, because English has such a complex past.

Now, where the orthography part comes in is that you can often tell the origin of a root in English by the orthography.  English spelling will use a letter combination for words from one language to represent a sound, even though we already have a perfectly good letter for that sound.  And its based partially on how orthographies interact, and partially on language change.

So, English has several alternate spellings based on etymology.  For example, the use of “ch” instead of “k” for the /k/ sound.  An example is the word “chemical”, which comes from Medieval Latin “alchimicus” through the word “chemic” in the late 1500s.  It goes back to the Greek form “khemeioa” through Arabic “al-kimiya”.

“Ch” is called a “digraph”, meaning two graphemes used to represent a single phoneme.  It is common to many roots received originally from the Greek.  These roots often passed through Latin,which used the “ch” digraph to represent the Greek /kh/ an aspirated voiceless velar stop, which Latin lacked.  The pronunciation became /k/ in Late Latin, and is thus the one we’ve inherited into English today.

I go into detail here for a few reasons.  First, understanding your own orthography can be useful ind designing others.  Second, most novels use strictly English orthography, so options are limited.  However, by considering the connotations of various loan word roots in English, you can achieve a certain amount of meaning.  For example, in order to differentiate a con-word in a novel, you can use common digraphs such as “ch”, even if the word would be pronounced with a normal /k/ sound.  In this way, you can make a word seem older, foreign, etc without resorting to special symbols, or that mainstay of conlang/foreign language typography, the italic word/phrase.

You can use similar digraphs/letter choices to create other differences.  Although it’s arguable that this is an appropriation rather than a proper usage, Jay Kristoff, in his book Stromdancer, spells the Japanese word “salary-man” “sarariman” in imitation of the most common Japanese romaji transliteration system, in order to make it seem more foreign.  I’d argue, however, that the clear English root makes a good case for spelling the word in English, to make clear its loan word status.

And this leads us to the concept of transliteration, which is spelling a word in Roman letters that normally would not be so spelled.  Chinese and Japanese both have a multitude of English transliteration schemes, such that the same word can look vastly different and be pronounced completely differently by a native English speaker.

This fact can be used quite effectively to manipulate orthography to create certain effects.  Really the only limit is your imagination.

Next time, we’ll be taking a bit of a side-trip while I talk about the use of “dialect” in speculative fiction–it probably applies to any type of fiction, though.

Next time on Linguistics and SFF: Appropriation and Dialect


Posted by on July 25, 2013 in Linguistics


Tags: , , , , ,

Linguistics and SFF: Orthography and Vowel Systems

Last time on Linguistics and SFF: Orthography and Exoticism

Last time, I discussed some basics on orthography and its function in writing systems.  I’ll be doing a bit more of that in this post, but in the context of using that insight for the purposes of writing speculative fiction.  I’ll be focusing, as the title suggests, on what the vowel system in a language suggests about the best orthography for that language.

First, considering the structure of the three major types of writing system, they have advantages and disadvantages, and these are in part based on the structure of the language to be represented.

The difference between alphabets and syllbaries is much smaller than the difference between them and a logography.   They both represent the sounds of a language rather than its meaning.  So I’ll be focusing first on them, and why you might want to us one over the other.

An alphabet is best for a language with a large number of sounds(phonemes).  English, for example, has a great many sounds, and not only that, it has far more vowel sounds than most languages.  English is most commonly described as having about 16 distinct vowel sounds.  The most common system of vowel sounds in the world is a system of 5: /a/ /e/ /i/ /o/ /u/.  English has equivalents of all those vowels.  I bet you can guess what they are.

The number of vowels matters for a very specific reason.  Vowels differ from consonants in that a consonant phone or phoneme has a very limited variety in its articulation.  Consonants have four major features:

1. Place of articulation describes where in the mouth you place your tongue when making the sound.  These are generally divided into laryngeal, the back of the throat, velar, the soft palate, palatal, the hard palate  alveolar, the ridge behind the teeth, dental, the teeth, and labial, the lips.

2. Manner of articulation describes how the breath is released during speech: stop/plosive, when the tongue/lips create a closure or “stop” of air, and the air is then released, affricate, where there is a stop, followed by a protracted release, fricative, where there is no closure, but rather a proximity of the tongue/lips to the place of articulation and a protracted release of air, and sonorant, which (basically) is everything else.

3. Voicing describes whether the vocal folds(vocal chords) are vibrating during sound production.  The major voicings are voiced, a clear vibration, unvoiced/voiceless, no vibration, and some other, less common voicings.

4. Aspiration describes the strength of the airflow.  It’s generally divided into aspirated, with a noticeable puff of air, and unaspirated, with no noticeable puff.

Now, these features are in general fairly binary, either “on” or “off”.  And that means you can’t vary your phones too much without changing your phonemes.  But vowels are different.  Vowels have basically two features.  However, there is quite a bit of room for variance, and so a sound that one speaker would perceive as an English short “e” might be considered by another to be an English short “i’.

The point being, that most languages have smaller systems of vowels because they divide the space up less finely.  It is possible to have a language with as few as two vowels, and three is not uncommon.  There are also many systems with seven vowels.  So, English is rather uncommon in the number of vowels it has.

You might have figured out already why the number of vowels is significant: even English with its many vowels and large number of consonants has only about 50 phonemes at most.  Fifty letters is not an over-large number to memorize.  But what about syllables?  With 12 vowels, 25 consonants and just using a (C)V syllable structure there are about 300 possible syllables. And English has fairly permissive structural rules for syllables, so the real number is much greater.  By contrast, Japanese, a language which uses a syllabary, has that five vowel system we mentioned before, and about 12 consonants.  That’s a little over 60 possible syllables with its (C)V(n) structure.  And even then, Japanese has shortcuts.  Consonant voicing uses a diacritic mark rather than a whole second set of syllables, and rather than having syllables with the coda (n), uses a single symbol for it, which is simply added after the (C)V syllable variant.  So in order to have a syllabary, a language must have relatively simple phonotactics, few phonemes, especially vowel phones, since vowel and consonant numbers are multiplicative under a syllabary.  Hawaiin, for example, has five vowels and 12 consonants and a (C)V syllable structure, and is thus perfect for a syllabary, whereas German, with its vowels and complicated consonant clusters is not.

A logography, of course, has no such issues, since it doesn’t care about sounds.  Logographies are great for languages with little morphology, since they aren’t alterable for tense or number, but require separate symbols for each.  They do, of course, have the same issues with massive memorization that an English syllabary would have.

Since writing systems tend to develop based on the languages they represent, you can do some fun and efficient world-building based on how you choose to represent a language orthographically.  Obviously you can’t really do much about an actual writing system in a novel, but it is something to keep in mind, especially when using a real-world language that doesn’t actually use English orthography.  When you think about that language and how the various characters feel about it, you can use that in description.

For example, many people think Chinese orthography is particularly beautiful or elegant, and the discipline of Chinese calligraphy arose from the type of writing system Chinese uses.  Whereas though English has forms of calligraphy and also medieval illumination, it is not seen in quite the same light.

In the next post, I’ll be discussing two things: etymology and orthography’s effect on the reality and perception of it, and also how you can use your chosen orthography to illuminate language and culture and also how when writing a real-world language you aren’t a native speaker of, you may mis-understand or mis-represent it because of your own prejudices and pre-conceptions based on your own language having a different writing system.

Next time on Linguistics and SFF: Orthography and Etymology


Posted by on July 20, 2013 in Linguistics


Tags: , , , , ,