CODA*: Conventional Orthography for Dialectal Arabic

1. CODA* Mission

  • Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard.
  • CODA* (pronounced CODA Star, as in, for any dialect) is a conventional orthography for dialectal Arabic. It is designed primarily for the purpose of developing computational models of Arabic dialects.

2. CODA* Goals & Intentions

  1. CODA* is an internally consistent and coherent convention for writing DA.
  2. CODA* is created for computational purposes.
  3. CODA* uses the Arabic script.
  4. CODA* is a unified framework for writing all DAs.
  5. CODA* aims to strike an optimal balance by maintaining a level of dialectal uniqueness yet establish conventions based on MSA-DA similarities.
  6. CODA* strives to be easily learnable and readable.

3. CODA Design Principles

  1. CODA is an ad hoc convention. There are numerous decisions that could have been made differently especially when it comes to the phonology/orthography interface. These principles make CODA comparable to English spelling (a bit phonological, a bit historical, with some exceptions). In some cases, we followed decisions that have been made by previously published efforts.
  2. CODA uses only the inventory of Arabic script characters including the diacritics used for writing MSA. CODA does not use extended Arabic characters, e.g. from Persian or Urdu. CODA* can be written undiacritized or diacritized.
  3. Each DA word has a unique orthographic form in CODA that represents its phonology, morphology, and lexical semantics [meaning].
  4. As a general rule, CODA* uses MSA-like orthographic decisions (rules, exceptions and ad hoc choices), e.g., cliticizing single letter particles, using Shadda for phonological gemination, using Ta-Marbuta, Alif Maqsura, silent Alif in Waw-Alif of plurality, and spelling the definite article Al morphemically.
  5. CODA* generally preserves the phonological form of dialectal words given the unique phonological rules of each dialect (e.g., vowel shortening), and the limitations of Arabic script (e.g., using a diacritic and a glide consonant to write a long vowel). Two important ad hoc exceptions pertain to specific root radical letters that happen to be highly variant across dialects, e.g. ق، ث، ذ، ظ، ج , etc. and to long pattern vowels that can be shortened deterministically in the dialects, e.g., the pattern 1awA2iy3 فواعيل. For these cases, the word is written using the MSA cognate root radicals or pattern.
  6. CODA* preserves dialectal morphology (e.g., dialectal clitics حتقول instead of ستقول). The only exception here is separating the negation and indirect object pronouns although they are part of the word: e.g. (Cairo): ما قلت لهاش /m a # 2 u l t # i l h aa sh/ ‘I did not tell her’.
  7. CODA* preserves dialectal syntax, i.e. there is no change in word order.
  8. CODA aims to be easy to learn and write, encouraging high inter-annotator agreement; the more CODA looks like what a dialect speaker may write, the better.
  9. CODA* rules are dialect independent. (note that dialect-specific exception lists from previous CODA versions have been redesigned and unified into dialect independent sets of specific rules).
  10. CODA* specific rules override general rules and apply to certain pre-defined classes of words – roughly corresponding to closed class or highly marking frequent words. The aim is to preserve important morphological information, maintain dialect integrity, and ensure overall readability.

4. CODA* Rules

4.1 Introduction

CODA (pronounced CODA Star, as in, for any dialect) is a conventional orthography for dialectal Arabic. It is designed primarily for the purpose of developing computational models of Arabic dialects. See CODA Main Page for a description of CODA* mission statement and design guidelines.

Some sections of these guidelines are continuously researched and updated as more dialectal data is incorporated.

(CODA* version: 0.43)

4.2 Basic Terminology

4.2.1 Sounds - Letters - Diacritics

The term sounds is used in the context of pronunciation (phonology), while letters and diacritics are used in the context of writing (orthography). Sounds can be consonants or vowels, and they are represented using the CAPHI representation (see Phonology Reference) and are bounded by forward slashes when necessary. Letters and diacritics are symbols used in the Arabic script to write words. Letters in the Arabic language are always required to be written; while diacritics are optional.

The space for consonants and vowels is shared by letters and diacritics, neither of which is exclusive to either category of symbols. To understand this shared space, keep in mind the following:


Letters can be used to write:

  • Consonants - /b/, written "ب"; /y/, written "ي", i.e: /b aa b/, باب, 'door'; or /y i k t u b/, يكتب, 'he writes'
  • Vowels - /i/, written "ي", i.e: /k i t aa b i/, كتابي, 'my book'

Diacritics can be used to specify:

  • No vowel - “ ْ ” (Sukuun), i.e: /k a l b/, كلْب, 'dog'
  • Double consonants - “ ّ ”ّ, (Shadda), i.e: /k a s s a r/, كسّر, 'he broke'
  • Vowels - “ َ ” (Fatha), i.e: /k a t a b/, كَتَب, 'he wrote'
  • Vowels+consonants - “ ً ”, /a n/, (Nunation), i.e: /f i 3 l a n/, فعلاً, 'verily'

4.2.2 Patterns - Other Morphemes

Arabic’s templatic morphology makes common reference to the concept of the root, a typically tri-consonantal abstraction capturing a general meaning about the word. For example, the root ك.ت.ب 'writing-related' appears in words like مكتب 'office' and كتاب 'book'. Each sound in the root is referred to as a radical. The general complement of the root is the pattern, which in the examples above are ma12a3 and 1i2A3 (here, 1, 2, 3 are slots for the root radicals). In addition to the root and pattern templatic morphemes, Arabic uses numerous other concatenative morphemes.

4.2.3 Words - Base Words - Clitics

We define an Arabic base word to consist of a stem and the minimal number of concatenative affixes needed to specify the obligatory features for its part of speech (POS). A stem can be non-templatic or it can be composed from the interdigitation of a root and a pattern. The pattern may specify the features fully, as in broken plurals. Base words are as such the smallest fully formed words. Examples include: كتابين 'two books' and يكتبون 'they write'. Clitics are syntactically independent but phonologically dependent morphemes that are attached to the word phonologically. Words can be base words or base words with added clitics. We use the term word to refer to the phonological utterance or the orthographic string, and we specify as needed. In CODA, phonological words typically map one-to-one to orthographic words; but there are many exceptions, pertaining mostly to clitics that are spelled as separate orthographic words.

Pronunciation (sounds)

/w i m a # b i y 2 u l h aa sh/

Orthography (letters)

" وما بيقولهاش"

Meaning

‘and he does not say it’

 

 

Morphology

Enclitics

Base Word

Proclitics

+ها

يقول

ب+

ما

و+

/sh/

/h aa/

/y i 2 uu l/

/b i/

/m a/

/w i/

 

Suffixes

Stem

Prefixes

 

(مستتر)

قول

ي

//

/2 u l/

/y/

 

Root

Pattern

 

Dialect: Cairo

q.w.l

1u23


Click to see another example

Pronunciation (sounds)

/w i 7 a y i k t i b uu h a/

Orthography (letters)

" وحيكتبوها"

Meaning

‘and they will write it’

 

 

Morphology

Enclitics

Base Word

Proclitics

+ها

يكتبوا

ح+

و+

/h a/

/y i k t i b uu/

/7 a/

/w i/

 

Suffixes

Stem

Prefixes

 

وا

كتب

ي

/uu/

/k t i b/

/y i/

 

Root

Pattern

 

k.t.b

12i3

Dialect: Cairo


4.3 General

4.4 Basic Phonology to Orthography Mapping

4.4.1 Hamza Rules

Hamza (Glottal Stop) spelling follows from the same rules as those of MSA and is unchanged from previous CODA versions.

(For a detailed explanation of Hamza spelling rules in MSA, you can refer to chapter 7 of the QALB annotation guidelines)


Note on base word initial Hamza: In previous versions of CODA, and in MSA spelling, base word initial Hamza have complex spelling rules. The rule is now simplified to normalize and not spell base word initial Hamzas, though the option remains to considers the Hamzation (أ, إ) at the beginning of a word as optional


Note on word initial Madda: any word that starts with a long vowel of the quality /2 aa/ is spelled with a Madda آ


Note on DA divergence from MSA cognates: Hamza spelling matches the sound. In other words, dialectal words that have MSA cognates containing Hamza but do not contain an /2/ in their phonology are spelled without the Hamza.

CODA CAPHI Gloss MSA Cognate Dialect Examples; Comments
الف 2 a l f thousand ألف Sanaa Base word initial Hamza is normalized (not written)
مألوف m a 2 l uu f familiar مألوف Sanaa
لا مؤاخذة l a # m u 2 a kh z a excuse me لا مؤاخذة Cairo
فئة f i 2 e denomination فئة Sanaa
يأنتر y i 2 a n t i r he is using the internet يستعمل الإنترنت Sanaa
بريء b a r ii 2 innocent بريء Sanaa
بير b ii r well بئر Sanaa Spelling follows dialect phonology: Hamza is dropped
سما s a m e sky سماء Sanaa Hamza is dropped from dialect, but final vowel spelling maintains the Alif taking etymology into account
ما m aa water ماء ِAlgiers Hamza is dropped from, but final vowel spelling maintains the Alif taking etymology into account
آسف 2 aa s i f sorry آسف Sanaa Madda rule
تأمين t e 2 m ii n insurance تأمين Beirut

4.4.3 Diacritics

While Arabic diacritics are optional in general, they can be crucial for disambiguation in certain contexts. Arabic diacritics are primarily used for representing short vowels, or absence of vowels. However, the Shadda diacritic is used to represent consonantal gemination, e.g. كَتَّب /k a t t a b/ ‘he dictated’. As such, the Shadda interacts with the number of letters in a word. The Shadda general rule states that it is used within the base word (including suffixes and prefixes), but not across word-clitic boundaries. Any exceptions must be specified in the specific rules (see specific rule: "The Definite Article" for an example of where the general Shadda rule is overriden.

CODA CAPHI Gloss Tokenized CODA Dialect NON-CODA examples; Comments
جنَّنَاهم g a n n a n n aa h u m we made them crazy جنننا+هم Cairo Note that this نا is an obligatory suffix referring
to the verbal feature [1p] and is part of the base word
جنَّنَنا g a n n a n n a he/it mad us crazy جنَن+نا Cairo
يبارككم b aa r i k k u m [he] congratulates you يبارك+كم Cairo
واحشيننا w aa 7 sh i n n a we miss you واحشين+نا Cairo

4.4.4 Long/Short Vowel Spelling

In many dialects, base word long vowels may be shortened in certain contexts. Generally, the rule is to prefer the long letter-based spelling over the shortened diacritic spelling.

CODA CAPHI Gloss MSA Cognate Dialect Examples; Comments
قانون 2 a n uu n law قانون Cairo So long as the vowel is the same quality as
the MSA, we can keep the MSA spelling
قولوا له 2 u l uu # l u tell him [2p] قولوا له Cairo this is because قول is not always short in Cairene
and can be pronounced long in different contexts

4.4.5 Root Radical Spelling

Dialectal word root radicals which have MSA cognates will be spelled using the MSA cognate radical if the dialectal radical sound and the MSA radical sounds are paired according to a specific set of common sound changes.

Our list of allowed pairings is presented in the table below:

CODA Dialectal Sound Variant(s) MSA Sound
ت t., d, d. t
ث t, t., s th
ج j, tsh, gy, y dj, g
د t., d. d
ذ d, dh., z dh
ر gh r
ز s, s. z
س s., z s
ش tsh sh
ص s, z s.
ض d, dh., z. d.
ط t t.
ظ d., z. dh.
ق j, dz, dj, k, g, qh, 2 q
ك ts, tsh, g k
ن m n

CODA CAPHI Gloss Letter Dialectal Sound MSA Sound Dialect NON-CODA examples; Comments
برتقان b u r t u 2 aa n orange ق 2 q Jeddah برتئان، برتقال; only the sound-letter pairs that
are in the list of permitted changes are changed
ثامن t aa m i n eighth ث t th Jeddah تامن
فستان f u s t. aa n dress ت t. t Amman فسطان، فصطان
ذيل d ee l tail ذ d dh Beirut ديل
ضحك d i 7 i k he laughed ض d d. Cairo دحك
ضفدعة d. u f d. a 3 a frog د d. d Cairo ضفضعة
ظهرية d. u h r i y y a shaddow ظ d. dh. Jeddah ضهرية
ذوق dh. oo g tasting ذ dh. dh Baghdad ظوق
ضغط dh. a gh a t. he pressed ض dh. d. Jerusalem ظغط
ثورة s a w r a revolution ث s th Cairo سورة
الزعيم 2 a s s e 3 ii m the boss ز s z Sana’a السعيم
صايغ s aa y e gh jeweler ص s s. Cairo سايغ
سلطة s. a l. a t. a salad س s. s Khartoum صلطة
ذبيحة z a b ii 7 a meat ذ z dh Muscat زبيحة
اسبوع 2 i z b uu 3 week س z s Khartoum ازبوع
صغير z gh ii r small ص z s. Beirut زغير
مضبوط m a z. b uu t. correct ض z. d. Jeddah مزبوط
عظيم 3 a z. ii m great ظ z. dh. Damascus عزيم
طريق t. a r ii j road ق j q Baghdad طريج
سمك s i m a ts fish ك ts k Riyadh(B) سمتس
كتابج k i t aa b i ts your [2fs] book ج ts dj, g Riyadh(B) كتابتس، كتابك; Remember that root radicals only
apply to the root of the base word, and that
clitics such as the possessive pronoun ج+
may have their own rules (see specification tables)
جزاير dz aa y i r Algeria ج dz dj, g Algiers دزاير
طريق t. a r ii dz road ق dz q Baghdad طريدز
عيونج 3 y uu n i tsh your eyes [2fs] ج tsh dj, g Doha عيونش، عيونتس; Remember that root radicals only
apply to the root of the base word, and that
clitics such as the possessive pronoun ج+
may have their own rules (see specification tables)
شاف tsh aa f he saw ش tsh sh Doha تشاف
سمك s i m a tsh fish ك tsh k Riyadh(B) سمج، سمتش
طريق t. i r ii dj road ق dj q Baghdad طريج
رقم r a k a m number ق k q Jerusalem(R) ركم
قال g aa l he said ق g q Aswan جال، كال
بنك b. a n g bank ك g k Baghdad بنج، بنق
رقم r a qh a m number ق qh q Jerusalem(R) رغم، رجم
غسالة kh a s s ee l a washer غ kh gh Tunis خسالة; while there is no /kh/ to غ mapping in the
root radical cognate table, this switch is allowed
in this particualr case because Tunisian use
of /kh a s s ee l a/ and /gh a s s ee l a/ is in free variance
اريد 2 a gh ii d I want ر gh r Mosul اغيد
طريق t. a r ii 2 road ق 2 q Baghdad طريء
جنب j a m b besides ن m n Jeddah جمب
جلس y a l a s he sat down ج y dj, g Abu Dhabi يلس

4.4.6 Pattern Spelling

Dialectal words with patterns that are cognates of MSA patterns will retain the spelling choice of the MSA pattern if the difference in pronunciation can be expressed using diacritics (for vowel change or absence), or if the pronunciation is a shortened form of the MSA pattern vowels.

CODA CAPHI Gloss Letter Dialectal Sound MSA Sound Dialect NON-CODA examples; Comments
اتضرب 2 i d. d. a r a b he got hit ت d. t Cairo ادضرب، اضّرب
اتدشدش 2 i d d a sh d a sh he got smashed ت d t Cairo ادّشدش

4.4.7 Alif Maqsura

The MSA rules for spelling the Alif-Maqsura (ى), which are sometimes based on roots and sometimes on patterns, apply in CODA*.

CODA CAPHI Gloss Dialect NON-CODA examples; Comments
اعطى 2 a 3 t. a give [3ms] Abu Dhabi اعطا
بغى b gh a want [3ms] Rabat بغا
حكى 7 a k e chat [3ms] Beirut حكي

4.4.8 Clitic Spelling

The general rule on phonological clitic spelling is that clitics that are mapped into single letters (with possible diacritics) will be spelled attached to the word, and will not interact with the spelling of the word. (For examples of specifc rules that override this rule, see "The Definite Article" and "Nominative Pronouns")

CODA CAPHI Gloss Tokenized CODA Dialect Comments
ولحد ما يجي w l a 7 a d d # m a # y ii g i and until he/it comes و+ل+حد ما يجي Cairo notice how و and ل (single letter clitics) attach
to adjacent morphemes while ما does not.

5. Specific Rules

5.1 The Definite Article

The Arabic definite article is always written as a proclitic ال+ـ, regardless of how it is pronounced (keep in mind that in DA, lunar/solar letters are not always the same as in MSA). As with MSA spelling, general cliticization rules apply except when following the proclitic ل+ـ, where the article is spelled without its ا. The general Shadda rule is overridden in the specific context of ل+ ال+ـ followed by an ل-initial base word.

CODA CAPHI Gloss Tokenized CODA Dialect Examples; Comments
القمر 2 i l 2 a m a r the moon ال+قمر Cairo
الشمس 2 i sh sh a m e s the sun ال+شمس Jerusalem
الكتاب 2 i k k i t aa b the book ال+كتاب Cairo Note how Cairene sometimes treats /k/ as a solar letter
البيت l b ee t the house ال+بيت Jerusalem
البيوت l e b y uu t the houses ال+بيوت Jerusalem
بالبيت b e l b ee t at home ب+ال+بيت Jerusalem
بالبيوت b l e b y uu t at the houses ب+ال+بيوت Jerusalem
للبيت l a l b ee t for the house ل+ال+بيت Jerusalem
للبيوت l a l e b y uu t for the houses ل+ال+بيوت Jerusalem
للشمس l a sh sh a m e s to the sun ل+ال+شمس Jerusalem
للشموس l a l e sh m uu s to the suns ل+ال+شومس Jerusalem
اللجنة 2 e l l a g n a the committee ال+لجنة Cairo
للجنة l e l l a g n a for the committee ل+ال+لجنة Cairo Note how it is not spelled لللجنة and the general shadda rules are
overriden (see general "Diacritics" rules).

5.2 The Ta-Marbuta

The Ta-Marbuta (ة) is a secondary letter of the Arabic alphabet used to represent a particular suffix morpheme that is often (but not exclusively) associated with the feminine-singular feature (Alkuhlani and Habash, 2011). This morpheme has a number of allomorphs with differing pronunciations. Most notably, it appears as a vowel at the end of nominals, and changes to a ∼ /t/ when followed by possessive pronominal enclitics. The Ta-Marbuta should be written as ة in word-final positions, regardless of its pronunciation, and following general CODA rules in non-word-final positions.

CODA CAPHI Gloss Dialect
حاجة 7 aa g a something Cairo
حاجتي 7 aa g t i my thing Cairo
حاجتها 7 aa g i t h a her thing Cairo
طاولة t. aa w l e table Jerusalem
غزالة gh a z ee l i gazelle Beirut
معلمة مدرسة m 3 a l m i t # m a d r a s e school teacher Jerusalem
معلمة مدرسة m 3 a l m e # m a d r a s e she taught a school Jerusalem
معلمتهم m 3 a l m i t h u m their teacher Jerusalem
معلماهم m 3 a l m aa h u m she taught them Jerusalem

5.3 The Plural Waw

Verbal suffixes that indicate the feature plural subject [2p] & [3p] and end with the sounds (/u/, /uu/, /o/, /oo/, and /aw/) will represent those sounds as وا+ (‘Waw of Plurality’ واو الجماعة) in word-final positions, and also when followed by other attached clitics. This rule is similar to the MSA rule, except for expanding the phonetic definition.

CODA CAPHI Gloss Dialect
قالوا 2 aa l u they said Cairo
بيقولوا b i y 2 uu l u they say Cairo
نقولوا n q uu l u we say Tunis
قالوا g aa l a w they said Abu Dhabi
قالوها 2 a l uu h a they said it Cairo
ما قالوش m a # 2 a l uu sh they did not say Cairo
قالوا له 2 a l uu # l u they said to him Cairo

5.4 Negation Clitics

The negation particle (/m a/, /m aa/) has phonologically become a proclitic in many dialects. However, it is always written as a separate particle ما except when overridden by other specification rules. One example of such a rule is the case of negated pronouns (see "Nominative Pronouns"), in which ما is written attached. Another example involves some negated existentials (see "Existentials") in which the Alif can be ellided.

CODA CAPHI Gloss Dialect
ما قال m aa # 2 aa l he did not say Damascus
ما قالش m a # 2 a l sh he did not say Cairo
ما بدناش m a # b i d d n aa sh we do not want Amman
مانيش m a n ii sh I am not Cairo
ماهياش m a h i y y aa sh she is not Cairo

5.5 Prepositional Enclitics

Post-verbal and post-nominal prepositions that have phonologically become enclitics will nonetheless be spelled separately from the words they follow. The most prominent such case is the preposition ل+ـ ‘to, for’ which introduces indirect verb objects in a number of dialects.

CODA CAPHI Gloss Dialect
قالوها لي 2 a l u h aa # l i they said it to me Cairo
ما قالوا ليش m a # 2 a l u # l ii sh they did not say it to me Cairo
بالنسبة له b i n n i s b aa # l u as for him Cairo

5.6 Numbers

The words for numbers in Arabic dialects are amongst the most rich in phonological variety. The rules of writing number words in CODA* add the following exceptions to the general rules:

  • The sometimes reduced historical Ta-Marbuta in the middle of the teens (11-19) is always written as ت regardless of its pronunciation as /t/ or /t./. It is never reduced to a Shadda diacritic.
  • The sometimes reduced historical /3/ ع in numbers such as عشر ‘ten, -teen’, and تسع ‘nine’ will always be spelled as ع even if completely elided or turned into a vowel.
  • The sometimes reduced or altered final letter of عشر ‘ten, -teen’ will be written as pronounced. The variation in this form marks different syntactic construction in some dialects.
  • The hundreds will be written as a single word only if the hundred part is singular in form.
  • The remnant /t/ of the historical Ta-Marbuta appearing only before Alif-initial words after number words will not be written.

The above rules apply to all number words, whether ordinal, cardinal, or fractions. Number words sometimes have different masculine and feminine forms that are used according to different dialect-specific rules. CODA guidelines do not interact with these dialect-specific decisions.

CODA CAPHI Gloss Dialect
ثمانية th a m a n y e eight Salt
ثمانية t a m a n y e eight Amman
ثمانة t a m aa n e eight Damascus
ثمان t a m a n eight X Amman
ثمان t m aa n eight X Damascus
ثمانتعش t a m a n t a 3 sh eighteen Amman
ثمانتعش t m a n t a 3 sh eighteen Damascus
ثمنتعش th m u n t. a 3 i sh eighteen Baghdad
ثمانتعشر t a m a n t aa sh a r eighteen Cairo
ثمانتعشر t a m a n t a 3 sh a r eighteen X Amman
ثمانتعشن th m a n t. aa sh e n eighteen X Tunis
اربعمية 2 a r b a 3 m i y y e 400 Amman
اربعمية 2 a r b a 3 m ii t 400 X Amman
ربعمية r u b 3 u m i y y a 400 Cairo
اربع الاف 2 a r b a 3 # t a l aa f 4,000 Cairo
خمس ارباع kh a m a s # t i r b aa 3 five-fourths Cairo

5.7 Pronominal Enclitics

The set of specifications for the pronominal clitics which can serve as possessive pronouns. Some of the decisions follow from the general rules, but for the most part they are intended to normalize the spelling as close as possible to the MSA variety without adding unnecessary and unresolvable ambiguity (e.g., using diacritics). It is important to point out again that this list is not dialect specific, but rather, it lists all the phonological forms of the pronominal morphemes in all dialects. The CODA spelling for a dialect will depend on the phonology-morphology pair it corresponds to. Some of these pronouns have a large number of variants that can be ambiguous cross-dialectally. An interesting example is the case of the morpheme pronunciation /a/ which can be 3rd masculine singular in Gulf Arabic, but 3rd feminine singular in North Levantine: /k t aa b + a/ can correspond to كتاب+ه ‘his book’ (Abu Dhabi) or to كتاب+ها ‘her book’ (Damascus). The CODA specification does not address how a particular dialect may organize the use of the different forms in terms of morphotactics, e.g., the possessive 2nd person singular feminine pronominal clitic is always ك+ in Tunis, and always كي+ in Mosul; however, in Amman, it is كي+ post-vocalically, and ك+ otherwise. The underspecification of some features is intentional as some pronominal clitics may be used with different associated genders in different dialects, e.g. كن+ is 2nd person plural feminine in Doha, but its is gender ambiguous in Beirut.


Here is the specification table for Pronominal Clitics:

CODA CAPHI Morpheme Features
ني /n i/, /n e/, /n ii/, /n ee/ 1st Person Singular
ي /i/, /ii/, /e/, /ee/, /y/, /y a/, /y e/ 1st Person Singular
ك /k/, /i k/, /e k/, /k a/ 2nd Person Singular
كي /k i/, /k e/, /k ii/, /k ee/ 2nd Person Singular Feminine
ج /tsh/, /i tsh/, /ts/, /i ts/ 2nd Person Singular Feminine
ه /h/, /h u/, /u/, /o/, /a/, /a h/, /u h/, [length] 3rd Person Singular Masculine
ها /h a/, /h aa/, /a/, /aa/, /h e/, /h ee/ 3rd Person Singular Feminine
نا /n a/, /n aa/, /n e/, /n ee/ 1st Person Plural
كم /k u m/, /k o m/ 2nd Person Plural
كن /k u n/, /k o n/, /tsh i n/ 2nd Person Plural
هم /h u m/, /h o m/, /u m/, /o m/ 3rd Person Plural
هن /h u n/, /h o n/, /u n/, /o n/ 3rd Person Plural

CODA CAPHI Gloss Dialect
كتابها k i t aa b a her book Damascus
كتابه k i t aa b a his book Abu Dhabi

5.8 Nominative Pronouns

Nominative Pronouns are spelled as pronounced using general phonolgical rules, except for the final vowels, which are spelled in two different ways:

  • (a) as diacritical forms if (i) the vowel is short and (ii) the spelling of the undiacritized base word exactly matches the MSA form for هو، هي، هم، هن and انتَ [masculine singular].
  • Else (b) as letter form (ا، و، ي) regardless of their length: احنا or انتو. Decision notes:
  • انتي - as in /2 i n t i/, ‘you [fs]’ is spelled with the final “ي” always to reflect its common usage - and distinguish it from the undiacratized form of انت.
  • نحنا - as in /n i 7 n a/, ‘we’ is spelled with a final “ا” and not as نِحْنَ because the MSA form is perceived to be highly associated with the final vowel /u/ as in /n a 7 n u/.
  • The rule treats هو، هي، هم، هن exceptionally because adding additional letters used for vowel marking (ا،و،ي،ه) can cause added ambiguity: هوا، هما،هيا
  • The Lebanese Arabic /h u w w i/, ‘he’ is spelled as هُوِّ as the final vowel /i/ is a short vowel and as such it follows from Rule Part (a).
  • Negated pronouns may become a single word when negated using a circumfix negation, e.g. /m a n ii sh/ “مانيش”.
    • Multiples of Alif: In the cases of negation particles ending with “ا” attaching to pronouns starting with “ا”, only one “ا” should remain, e.g. /m a 7 n aa sh/ → ماحناش → ما+احنا+ش.
CODA CAPHI Gloss Dialect
احنا 2 i 7 n a we Cairo
انا 2 a n a I Cairo
انت 2 i n t a you [2m] Cairo
انتو 2 i n t u you [2p] Cairo
انتي 2 i n t i you [2fs] Cairo
اني 2 a n i I Tripoli
نحنا n i 7 n a we Cairo
هم h u m m a they Cairo
هو h u w w a he, it Cairo
هي h i y y a she, it Cairo

CODA CAPHI Gloss Tokenized CODA Dialect
ماحناش m a 7 n aa sh we are not ما+احنا+ش Cairo
مانتاش m a n t aa sh you are not ما+انت+ش Cairo
مانتوش m a n t uu sh you are not ما+انتو+ش Cairo
مانتيش m a n t ii sh you are not ما+انتي+ش Cairo
مانيش m a n ii sh I am not ما+اني+ش Cairo
ماهماش m a h u m m aa sh they are not ما+هم+ش Cairo
ماهواش m a h u w w aa sh he is not, it is not ما+هو+ش Cairo
ماهياش m a h i y y aa sh she is not, it is not ما+هي+ش Cairo

5.9 Vocative Familial Expressions

Some of the vocative expressions used primarily for familial reference have vocalic endings that are homophonous with pronominal suffixes. These endings are spelled following the general phonology-to-orthography rules. For example, the word /3 a m m o/ in the dialect of Amman can mean ‘uncle!’ (spelled in CODA as عمو) or ‘his uncle’ (spelled in CODA as عمه).

CODA CAPHI Gloss Dialect
عمو 3 a m m o uncle! Amman
عمه 3 a m m o his uncle Amman

5.10 Relative Pronouns

Some forms containing relative pronouns can have different spelling rules, primarily to disambiguate differences in meaning.

CODA CAPHI Gloss Dialect Examples; Comments
اللي 2 i l l i who, which, whom Cairo Follows general spelling rules
يا اللي y a l l i O you who.. Cairo Following general rules, the vocative morpheme يا is spelled
separately connoting a vocative expression
يا اللي y e l l i O you who.. Tunis Following general rules, the vocative morpheme يا is spelled
separately connoting a vocative
expression
ياللي y a l l i who, which, whom Beirut هو ياللي اخترته; Note from the meaning how this is not a vocative
expression, and the يا is simply a part of the relative pronoun in
some dialects of the Levant
لاللي l i l l i to whom.. Riyadh Following general cliticization rules, the ل prepositional clitic
attaches to the relative pronoun without changing its spelling

5.11 Waqt/Sa3a forms

Some forms related to time, such as the Cairene /d i l w a 2 t i/, دلوقتي, 'now', are frozen forms which etymologically transform and combine various morphemes, in this case: هذا + الوقت. Instead of spelling it phonologically دلوئتي, we preserve etymological information by considering the nominal part of the word and spelling it according to general root radical cognate rules.

CODA CAPHI Gloss Dialect
لسا l i s s a not yet Abu Dhabi, Beirut
هسا h a s s a now, just now Baghdad, Amman
هسا h i s s e now, just now Baghdad
هسعتا h a s s a 3 t a now, just now Mosul
هلقيت h a l 2 ee t now, just now Jerusalem
هلقيت h a l k ee t now, just now Jerusalem(R)
شوقت sh w a q i t when Mosul
شوقت sh w a k i t when Baghdad
وقتيش w a 2 t ee sh when Jerusalem
وقتيش w a k t ee sh when Jerusalem
ابساع 2 i b s aa 3 quickly Baghdad
دلوقتي d i l w a 2 t i now, just now Cairo
فيسع f ii s a 3 quickly Tunis
لسع l i s s a 3 not yet Jeddah
هلق h a l l a 2 now, just now Beirut, Jerusalem, Amman, Damascus
فوقاش f uu q aa sh when Rabat
وقتاش w a q t aa sh when Rabat
وقتاش w a q t ee sh when Tunis

5.12 Words with the name of God ‘Allah’

Nominative Pronouns are spelled as pronounced using general phonolgical rules, except for the final vowels, which are spelled in two different ways:

  • Words containing the name of God ‘Allah’ will maintain its MSA spelling, i.e /b a l. l. a/, “بالله” Decision notes:
  • We make an exception for /y a l. l. a/ to accept both the etymological spelling “يالله” and the phonological spelling “يلا”.
  • It is observed that “يلا” is sometimes preferred over “يالله” to avoid using the name of God in contexts that might be considered indecent.
  • In some dialects such as Tunisian, “يلا” inflect as a verb in the command aspect, hence the spelling “يلا” is more appropriate.
CODA CAPHI Gloss Dialect
يلا y a l. l a hurry up, come on! Abu Dhabi
يلا y a l. l. a hurry up, come on! Cairo
يلا y a l l a hurry up, come on! Rabat
يالله y a l. l a hurry up, come on! Abu Dhabi
يالله y a l. l. a hurry up, come on! Cairo
يالله y a l l a hurry up, come on! Beirut, Rabat
ان شاء الله 2 i n # sh aa # l. l a in God’s will Abu Dhabi
ان شاء الله 2 i n # sh aa # l l a in God’s will Beirut
ان شاء الله 2 i n # sh aa 2 # a l. l. aa h in God’s will Cairo
ان شاء الله 2 i n # sh a # l. l. a in God’s will Cairo
يا الله y a # 2 a l. l. a oh God! Cairo, Abu Dhabi, Rabat, Beirut

5.13 Exestentials

  1. The existential expression /f ii/, فيه, 'there is', is attached to the pronominal clitic [3ms] ه+, known in Arabic grammar as Dhameer Al Sha'n, ضمير الشأن, though its often not prounounced.
  2. Another existential rule involves negated existentials, as in the Cairene /m a f ii sh/, مفيش, 'there isn't', in which the negation clitic is shortened and attached.
CODA CAPHI Gloss Dialect
فيه f ii there is Cairo
به b e h there is Sanaa

CODA CAPHI Gloss Tokenized CODA Dialect Examples; Comments
مفيش m a f ii sh there isn't ما+في+ه+ش Cairo
مابش m aa b i sh there isn't ما+ب+ه+ش Sanaa Note how long vowels are never shortened

5.14 Demonstrative Pronouns

Demonstrative pronouns can be found in three forms:

  • Simple pronouns: which is the demonstrative pronoun on its own, e.g.: /d oo l/, دول, 'these, those'
  • Extended pronouns: which are extensions of the simple pronoun, e.g.: /h a d oo l/, هدول, 'these, those'
  • Complex pronouns: which consist of a demonstrative pronoun and a personal pronoun attached together, e.g.: /h a h a w w a/, هاهو, 'there he is, there it is'
  • If Simple, spell phonetically according to the general rules, with the following in mind:
  • Pronoun consonants that might have MSA cognate “ذ” such as EGY /d oo l/ will be spelled phonetically as “دول”. Emphatic variants of the cognate are not considered, such as LEV /h aa dh./ are spelled as the cognate “هاذ”.
  • Some pronouns might vary in vowel length only, in such case, long vowel variant is preferred.
  • If extended, spell the extensions according to the general rules.
  • If complex, determine the personal pronoun and spell it according to the rules in "Nominative Pronouns", spell the demonstrative pronoun part according to the simple pronoun rules.
CODA CAPHI Gloss Dialect
اهو 2 a h a w w a there he is, there it is Tunis
اهو 2 a h o there he is, there it is Cairo
اهوكا 2 a h a w k a there he is (unseen) Tunis
اهوكم 2 a h a w k u m there they are (unseen) Tunis
اهوما 2 a h a w m a there they are Tunis
اهي 2 a h a y y a there she is, there it is Tunis
اهي 2 a h e there she is, there it is Cairo
اهيكا 2 a h a y k a there she is (unseen) Tunis
داك d aa k that Rabat
داك الشي d aa k # i sh sh i that thing Rabat
ده d a h this, that Cairo
دوك d uu k those Rabat
دوكهم d uu k h u m these, those Cairo
دول d oo l these, those Cairo
دول d uu l these, those Cairo
دي d i this, that Cairo
ديك d ii k that Rabat
ذاك dh aa k that Abu Dhabi
ذول dh oo l these, those Abu Dhabi
ذي dh ii this Abu Dhabi, Rabat
ذيلا dh ee l a these, those Abu Dhabi
ذيلن dh ee l i n these, those Abu Dhabi
راك r aa k there you are Rabat

5.15 ADVERBIAL, like this/that

1) Similar to demonstrative pronouns, this class of adverbials keeps general CODA rules, except for the "د" which is spelled as pronounced. Also note that "كده" was preferred with ه rather than ا (counter to general rules), because it appears much more frequently.

CODA CAPHI Gloss Dialect
كي tsh ii y like this, like that Abu Dhabi
كي tsh ii like this, like that Abu Dhabi
كده k i d a like this, like that Cairo, Jedda, Khartoum
كده k e d a like this, like that Cairo
كذي tsh i dh i like this, like that Abu Dhabi, Kwait, Manama
هكا h a k k a like this, like that Rabat, Tunis
هكاك h a k k aa k like this, like that Rabat
هكايا h e k k ee y a like this Tunis
هكدا h a k d a like this, like that Rabat
هكداك h a k d aa k like this, like that Rabat
هكيكا h a k ee k a like this, like that Tunis
هيك h a y k like this, like that Beirut
كده k i d a like this, like that Cairo, Jedda, Khartoum
كذه k i dh a like this, like that Sanaa
كذيه k i dh e y y e like this Sanaa
هكي h i k i like this, like that Benghazi
هكـي h i k k i like this, like that Mosul
هيك h ii tsh like this, like that Baghdad
هيك h ee k like this, like that Amman, Damascus, Jerusalem

6. CODA* Seed Lexicon

A large and growing database containing verified examples of CODA* spelling for dialectal words, including affixes and clitics, is available here.

7. Publications and Resources About CODA

  • Unified Guidelines and Resources for Arabic Dialect Orthography, 2018.1
  • A Conventional Orthography for Tunisian Arabic, 2014.2
  • A Conventional Orthography for Maghrebi Arabic, 2016.3
  • Palestinian Arabic Conventional Orthography Guidelines, 2015.4
  • A Conventional Orthography for Algerian Arabic, 2015.5
  • Conventional Orthography for Dialectal Arabic, 2012.6
  • Conventional Orthography for Dialectal Arabic (CODA) Version 0.1, 2011.7

8. Publications and Resources Using CODA

  • Noise-Robust Morphological Disambiguation for Dialectal Arabic, 2018.8
  • Addressing Noise in Multidialectal Word Embeddings, 2018.9
  • SUAR: Towards Building a Corpus for the Saudi Dialect, 2018.10
  • Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging, 2019.11
  • TArC: Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus, 2020.12
  • Targeted Topic Modeling for Levantine Arabic, 2020.13
  • Part-of-Speech Tagging for Arabic Tweets Using CRF and Bi-LSTM, 2020.14
  • Morphological Tagging and Disambiguation in Dialectal Arabic Using Deep Learning Architectures, 2020.15
  • Transcription de Corpus Oraux d’Arabe Parlé en Interaction. Convention AraPI et Annexes, 2019.16
  • Learn Palestinian Arabic, 2016.17

9. Contributors


Project Lead

  • Nizar Habash, New York University Abu Dhabi, UAE.

Current Maintainer

  • Fadhl Eryani, New York University Abu Dhabi, UAE.

Team

  • Dana Abdulrahim, University of Bahrain, Bahrain.
  • Emad Adel, Sbikha 1979 High School, Tunisia.
  • Diyam Akra, Birzeit University, Palestine.
  • Faeq Alrimawi, Birzeit University, Palestine.
  • Mahdi Arar, Birzeit University, Palestine.
  • Eric Bartolotti.
  • Lamia Belguith, Université de Sfax, Tunisia.
  • Rahma Boujelbane, Université de Sfax, Tunisia.
  • Tariq Daouda, Université de Montréal, Canada.
  • Mona Diab, George Washington University, USA.
  • Mariem Ellouze, Université de Sfax, Tunisia.
  • Alex Erdmann, New York University Abu Dhabi, UAE.
  • Sara Hassan, New York University Abu Dhabi, UAE.
  • Mustafa Jarrar, Birzeit University, Palestine.
  • Salam Khalifa, New York University Abu Dhabi, UAE.
  • Abir Masmoudi, Université de Sfax, Tunisia.
  • Owen Rambow, Stony Brook University, USA.
  • Nassim Regragui, Copenhagen Business School, Denmark.
  • Houda Saadane, Université de Marne-la-Vallée, France.
  • Houcemeddine Turki, Université de Sfax, Tunisia.
  • Nasser Zalmout, New York University Abu Dhabi, UAE.
  • Inès Zribi, Université de Sfax, Tunisia.

  1. Nizar Habash, Fadhl Eryani, Salam Khalifa, Owen Rambow, Dana Abdulrahim, Alexander Erdmann, Reem Faraj, Wajdi Zaghouani, Houda Bouamor, Nasser Zalmout, Sara Hassan, Faisal Al shargi, Sakhar Alkhereyf, Basma Abdulkareem, Ramy Eskander, Mohammad Salameh, and Hind Saddiki. Unified guidelines and resources for Arabic dialect orthography. In Proceedings of the Language Resources and Evaluation Conference (LREC). Miyazaki, Japan, 2018. URL: https://www.aclweb.org/anthology/L18-1574.pdf

  2. Ines Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Belguith, and Nizar Habash. A conventional orthography for Tunisian Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC). Reykjavik, Iceland, 2014. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/219_Paper.pdf

  3. Houcemeddine Turki, Emad Adel, Tariq Daouda, and Nassim Regragui. A Conventional Orthography for Maghrebi Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC). Portorož, Slovenia, 2016. URL: https://www.researchgate.net/profile/Houcemeddine_Turki/publication/311589181_A_Conventional_Orthography_for_Maghrebi_Arabic/links/584fd40608ae4bc8993b35ae.pdf

  4. Nizar Habash, Mustafa Jarrar, Faeq Alrimawi, Diyam Akra, Nasser Zalmout, Eric Bartolotti, and Mahdi Arar. Palestinian Arabic conventional orthography guidelines. Technical Report, Birzeit University and New York Univesity Abu Dhabi, 2015. URL: http://www.jarrar.info/publications/HR15.pdf

  5. Houda Saadane and Nizar Habash. A Conventional Orthography for Algerian Arabic. In Proceedings of the Workshop for Arabic Natural Language Processing (WANLP), 69. Beijing, China, 2015. URL: https://www.aclweb.org/anthology/W15-3208.pdf

  6. Nizar Habash, Mona Diab, and Owen Rambow. Conventional Orthography for Dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), 711–718. Istanbul, Turkey, 2012. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/579_Paper.pdf

  7. Nizar Habash, Mona Diab, and Owen Rambow. Conventional Orthography for Dialectal Arabic (CODA) Version 0.1. Technical Report CCLS-11-02, Columbia University Center for Computational Learning Systems, 2011. URL: https://academiccommons.columbia.edu/doi/10.7916/D8V69SG2/download

  8. Nasser Zalmout, Alexander Erdmann, and Nizar Habash. Noise-robust morphological disambiguation for dialectal Arabic. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). New Orleans, Louisiana, USA, 2018. URL: https://www.aclweb.org/anthology/N18-1087.pdf

  9. Alexander Erdmann, Nasser Zalmout, and Nizar Habash. Addressing noise in multidialectal word embeddings. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Melbourne, Australia, 2018. URL: https://www.aclweb.org/anthology/P18-2089.pdf

  10. Nora Al-Twairesh, Rawan Al-Matham, Nora Madi, Nada Almugren, Al-Hanouf Al-Aljmi, Shahad Alshalan, Raghad Alshalan, Nafla Alrumayyan, Shams Al-Manea, Sumayah Bawazeer, Nourah Al-Mutlaq, Nada Almanea, Waad Bin Huwaymil, Dalal Alqusair, Reem Alotaibi, Suha Al-Senaydi, and Abeer Alfutamani. SUAR: towards building a corpus for the Saudi dialect. In Proceedings of the International Conference on Arabic Computational Linguistics (ACLing). 2018. URL: http://fac.ksu.edu.sa/sites/default/files/suar.pdf

  11. Nasser Zalmout and Nizar Habash. Joint diacritization, lemmatization, normalization, and fine-grained morphological tagging. arXiv preprint arXiv:1910.02267, 2019. URL: https://arxiv.org/abs/1910.02267

  12. Elisa Gugliotta and Marco Dinarelli. Tarc: incrementally and semi-automatically collecting a tunisian arabish corpus. arXiv preprint arXiv:2003.09520, 2020. URL: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.770.pdf

  13. Shorouq Zahra. Targeted topic modeling for levantine Arabic. 2020. URL: https://www.diva-portal.org/smash/get/diva2:1439483/FULLTEXT01.pdf

  14. Wasan AlKhwiter and Nora Al-Twairesh. Part-of-speech tagging for Arabic tweets using crf and bi-lstm. Computer Speech & Language, 65:101138, 2020. URL: https://www.sciencedirect.com/science/article/pii/S0885230820300711?casa_token=drTyaeip1vwAAAAA:I0QYQTH6-j7geAmGb2x_0JzydWAERNQGOCs-5g6g_lumIZWlvAH5TiGm6nzPdsJm3fRe9zKZ

  15. Nasser Zalmout. Morphological Tagging and Disambiguation in Dialectal Arabic Using Deep Learning Architectures. PhD thesis, New York University Tandon School of Engineering, 2020. URL: https://search.proquest.com/docview/2385667717?pq-origsite=gscholar&fromopenview=true

  16. Lina Choueiri, Loubna Dimachki, Catherine Pinon, and Véronique Traverso. Transcription de corpus oraux d’arabe parlé en interaction. convention arapi et annexes. 2019. URL: https://hal.archives-ouvertes.fr/hal-02153116/document

  17. Learn palestinian Arabic, the spoken dialect of palestine. 2016. URL: http://www.learnpalestinianarabic.com/