CADIM Consortium

Computational Approaches to Arabic & Arabic Dialect Modeling

The CADIM Consortium

The Computational Approaches to Arabic & Arabic Dialect Modeling (CADIM) Consortium is comprised of three research labs in The George Washington University, New York University Abu Dhabi, and Stony Brook University. CADIM started in 2005 as the Columbia Arabic Modeling Group with Drs. Mona Diab, Nizar Habash and Owen Rambow at the Columbia University Center for Computational Learning Systems (CCLS). In 2014, the CADIM Group renamed as the CADIM Consortium as the three founder separated, with Diab and Habash joining The George Washington University and New York University Abu Dhabi, respectively. In 2017, CCLS officially closed and Dr. Rambow joined Elemental Cognition. Dr. Rambow joined Stony Brook University in 2020.

Natural Language Processing of Arabic and its Dialects

The focus of CADIM has been natural language processing for Arabic and its dialects. The Arabic language is actually a collection of dialects with important phonological, morphological, lexical, and syntactic differences. However, throughout the Arab world, the standard written language is the same, Modern Standard Arabic (MSA), that is also used in some official spoken communication (newscasts, parliamentary debates). MSA is based on Classical Arabic and is itself not a native spoken language. This situation has important negative consequences for Arabic automatic speech recognition (ASR) and natural language processing (NLP): since the spoken dialects are not officially written, it is costly to obtain adequate corpora to use for training the kind of ASR and NLP tools commonly in use today, for example, language models for ASR. Experience has shown that using MSA text for language models is ineffective in improving dialect ASR.

The CADIM group produced a couple of hundred publications and a number of popular tools, resources and standards for Arabic processing.