Studying Arabic at the Advanced-Mid/Advanced-High Level

Applying Advanced Arabic to Research Part II: Intro to Collecting Event Data


While the last section focused on the basics of handling pre-existing datasets, this section will teach the advanced Arabic learner more about the process of compiling a dataset in Arabic. To do so, I will share the basics of collecting sources for an event dataset. I use this example because event datasets are wide-ranging in use and applications, and also require language skills that make the process distinct from what early-career researchers might be used to in their native languages.

While considerable advances have been made recently in automated data scraping, there are still many instances where manual data collection and hand coding produce better data and actually ends up being more efficient. This guide focuses on manual collection. Before I proceed to the more language-focused portion, the next two sections provide some key background information.

What is event data? Why collect your own?

Event data usually seeks to capture occurrences of a specific type of event, usually at least with temporal and spatial variables included, and often many more. In the social sciences, these are usually human behavioral events. Event data is particularly popular among scholars of social movements/contentious politics, political violence, and institutional behavior. Event data can be collected by direct observation, though is usually assembled using secondary sources from media and government records.

While there are several aggregators like ACLED that collect data on topics above, they are only recently starting to fully integrate Arabic language news sources. As such, particularly when studying periods prior to 2015, collecting one’s own event data can be a useful contribution to knowledge that can merit supporting a publication. That said, be sure to not spend time recollecting the sources ACLED already compiled!

While an entire separate guide could be written about the methodological and ethical quandaries involved in event data collection, one key consideration should guide your project from start to finish. All event data is extremely biased unless collected by direct observation (which stills involves bias). Remember, ultimately, *you* are not selecting observations into the dataset, the news editors/government officials that created the primary sources did. Know the media landscape of the countries you are studying.

How to Get Started 

After picking what type of event data you are interested in and what the parameters of inquiry are, here are tips, not necessarily in order, to guide your search:

  1. Browse the advanced search settings on Google. Particularly helpful options are country of origin, language, and time range. While Arabic is important and probably going to be the language of most of your sources–keep in mind that helpful events may be reported in English, French, and other languages too!
  2. Map out the vocabulary you want to use for your key search terms, and run multiple rounds of searches using synonymous/adjacent phrases for each time period you are interested in. For example, you definitely know the word “احتجاج” (protest), but make sure you try phrases like “مسيرة” (march) too.
  3. Many countries in the region have news aggregators. Here is an example of a particularly strong on for Algeria. If you identify one that seems to be capturing a large amount of events you are interested in, set your advanced search to just that website. Alternatively, you may want to manually scroll through that website’s archives, though there are many cases where advanced google search is better.
  4. Even with great Arabic typing skills, many variables will likely be easy to copy/paste from the articles into an excel sheet. Always convert the text you are copying into plain text before pasting it into your sheet. The easiest way I have found to do this is to copy what I want into the search bar of my web browser–which only uses plain text–and then copying it again. This will save an immeasurable amount of time later on and keep your excel sheet cleaner.
  5. While most variables are up to you, keeping the links are of course non-negotiable. In addition, ALWAYS COPY THE FULL TEXT OF THE SOURCES INTO YOUR SPREADSHEET. Links break, and you never know whether you might want to analyze the data in a different way. This is especially important for data not in your native language, where you may learn that terms you were coding originally didn’t quite mean what you think they did! Of course, check copyright laws to make sure this is ok, especially before publishing the dataset, but this practice usually falls under fair use in U.S. law if you’re just archiving it for your research.

Finally, the most important tip, as with everything–try and connect with folks who have done similar work using Arabic sources to ask advice!


Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Resources for Self-Instructional Learners of Less Commonly Taught Languages Copyright © by University of Wisconsin-Madison Students in African 671 is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.