How to Survey the Literature | Bingsheng Arthur Yao

Chapter 1 Episode 1

If you ask me what the single most important research skill a PhD student needs to master, my answer is the literature survey. A literature survey is a way to discover knowledge on your own, as it teaches you where to find what you need and how to make sense of it without waiting for someone to hand it to you. All the knowledge you need to build on, findings your field has established, methods that have been validated, and questions that remain open live in the published papers in your area. Starting with this article, I will introduce you to several research methodologies that I repeatedly teach to research assistants.

Starting the Search

Every survey starts from a seed, a keyword, or a set of keywords that reflect the high-level ideas you are interested in exploring. At this stage, the keywords do not need to be perfect. In fact, they are almost always too broad or too narrow at first, and that is fine, because the search itself is how you learn how the field talks about the things you care about. Run your initial keywords on Google Scholar and scan what comes back. As you find papers that look relevant, pay attention to the specific terms those papers use in their titles, abstracts, and keyword lists, since the vocabulary the field has settled on is often different from the vocabulary you started with. Each round of searching should refine your keywords, and the papers you find will become more targeted as a result. The process of looking is what teaches you how to look, so do not wait for the perfect query on day one; start searching and let the iteration do its work.

When you scan your search results, pay attention to where each paper was published, because the venue tells you which research community is speaking. In fact, a practical problem can attract interest from several communities at once, and what separates them is the particular aspect each community cares about. If you search for something like “AI-assisted clinical decision-making,” for example, you will find papers from HCI venues like CHI and CSCW that focus on how clinicians interact with AI recommendations and how the AI-assisted workflow should be designed; papers from NLP venues like ACL and EMNLP that focus on processing clinical text and building language models for medical records; and papers from medical informatics venues like AMIA and JAMIA that focus on integrating AI into hospital systems and measuring patient outcomes. All of these communities work on the same broad problem, but each approaches it from a different angle, using different methods and having different conversations. Identify which venues belong to your research community and focus your reading on papers from those venues, since those are the papers whose methods, questions, and standards will be most directly relevant to your own work. You will still encounter and learn from papers published in neighboring communities, but knowing which conversation is yours keeps the survey anchored and prevents the scope from drifting into territory you cannot meaningfully engage with.

Snowballing

Once you have a set of seed papers from your initial search, the next step is to expand your coverage systematically. Snowballing is one of the few literature discovery techniques with formal validation behind it, and it is the core method for building out your map from the seeds. Each paper you read points you to other papers through its references and through the papers that have cited it, and following those pointers in both directions is what makes the map grow. During snowballing, you will need to classify what you have collected so that later you will apply different reading strategies to each class, a process I introduce at the end of this post and cover in full in the next.

Backward snowballing means working through the reference list of a key paper and following the citations that appear relevant. A reference list is a curated pointer to the foundations of the area, since the authors selected those references as the work they considered important enough to build on. When you notice a paper citing five or six earlier works that seem central to the problem, those earlier works should go into your reading queue. Each of those papers will have its own reference list, and following this chain is what pulls you back through the intellectual history of the question you care about. (A side note: you will eventually read most, if not all, foundational papers throughout the Ph.D journey, so I would suggest reading them every time you encounter them instead of putting them aside.)

Forward snowballing works in the opposite direction. Given a key paper, you find all the papers that have cited it since it was published, which Google Scholar supports directly through its “Cited by” link. Forward snowballing gives you the recent and concurrent work that responded to the same foundations your seed builds on, and this is often where your closest competitors and potential collaborators are publishing. Alternating between backward and forward passes from multiple seed papers is what fills in the map rapidly, and after two or three cycles you will start to notice the same names and the same references appearing repeatedly, which is the first sign that the core of the field is becoming visible to you.

You can also run the same searches directly on venue-specific databases like ACM Digital Library or ACL Anthology, which can catch papers that Google Scholar’s indexing occasionally misses, and browsing the recent proceedings of your target venues gives you a feel for what the field is publishing right now.

Knowing When to Stop

A literature survey has no formal finish line, so you need a practical heuristic for when you have covered enough ground to move forward. The clearest signal is reference saturation, which you recognize when new papers you find keep citing the same set of references you already know and new keyword searches return papers you have already read. At that point, you should be able to articulate, in a few sentences, what has been done, what the dominant approaches are, and where the open questions lie. The gap you are looking for lives in that last piece, the questions the field’s own papers acknowledge as limitations or future work and that no one has addressed well. If you cannot articulate the gap, your map still has holes and you should keep searching.

Classifying What You Found

Once you have collected a body of papers through your search and snowballing, the next step is to classify them by relevance to your work, because the classification determines how deeply you read each one. I would recommend doing the classification during the searching process so that it can save some time and effort. A practical three-way classification works well at this stage: papers that are highly relevant to your proposed work or direct interest, papers that are loosely relevant to your broader area but not directly connected to your specific question, and papers that are clearly outside your scope, which you set aside. You can make this initial classification quickly based on title, abstract, and a glance at the introduction, and you should not agonize over borderline cases, since the classification is a working tool you will revise as your understanding sharpens.

In the next post, I will cover the following questions in full: how to read the papers in each class, how to take notes that remain useful months later, and how to turn the entire survey process into a method for teaching yourself the craft of research.