Studying Arabic at the Advanced-Mid/Advanced-High Level

Applying Advanced Arabic to Research, Part I: Intro to Arabic Language Data Analysis

Introduction to the Series

As our pressbooks site contains extensive materials covering the fundamentals of Arabic language learning at all levels, this entry is a first in a series that seeks to branch out and cover research applications of the language–an essential part of continued learning.

The Arabic Language & Data Analysis

Despite immense progress in recent years in many data maniupulation platforms in integrating non-latin script, analyzing Arabic text can present challenges–especially for those at the beginning stages of learning coding langugages like R. The following are general tips on how to get started, and assumes a basic understanding of R that a graduate student in social sciences/sciences likely acquires within the first month of their methods sequence. This first installment focuses on basic data manipulation: loading packages into R, renaming variables, and other key tips to get you started.

  1. Before moving into the R portion–there’s an important background skill at play here that many Arabic learners struggle with because of how the language is taught in Western countries…typing. Typing in Arabic will be an essential skill in the long run, but is often surprisingly difficult to get used to and learn effectively. If you are on a tight timeline with the project, forcing yourself to type in Arabic might not be the best use of your time, and you may want to copy and paste words instead. However, at some point, be sure to set aside time during a break to learn this skill–it saves time in the long-run and only becomes more essential as you progress and engage with the language more in your work.
  2. The first step is to download a package that can read Arabic data into R.
  3. Check you Arabic language dataset and be sure to convert it into a regular .csv file in UTF-8.
  4. Even though RStudio can now read Arabic variables into your environment…BE WARNED…writing the actual Arabic words in RStudio is a fool’s errand. It is better to write our your code in a word document and then paste it into RStudio to see if it runs.
  5. Due to #2, depending on what you are trying to accomplish, it may be worthwhile to recode your variables in latin script to facilitate the coding process and allow you to test code as you go along. In general, it’s a good idea to do this because you never know how you’ll end up reusing this data. There are two things however to keep in mind if you do this:
    1. Be sure to keep a table of the translations in case you need to publish the final dataset in Arabic or need to merge with other Arabic language data in the future. It is also essential for the integrity of the codebook.
    2. Because of the previous step, DO NOT CHANGE THE NAMES OF THE NEW LATIN VARIABLES, this will cause immense confusion if you need to translate them back to Arabic later.

After completing these steps, you should be ready to get started on simply univariate and bivariate analysis, and even run the vast majority of more complex quantitative models on your data.



Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Resources for Self-Instructional Learners of Less Commonly Taught Languages Copyright © by University of Wisconsin-Madison Students in African 671 is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.