Tracing policy signals in text

Exploring development priorities in World Bank operational documents with R

WB
NLP
TextAnalytics
ML
DigitalHumanities
Rstats

The idea of analyzing language as data has always intrigued me. In this deep dive, I focus on ~4,000 World Bank Projects & Operations, zooming in on the short texts that describe the Project Development Objectives (PDOs)—an abstract of sorts for Bank’s operations.
This explorative analysis revealed fascinating—and surprising—insights, uncovering patterns and correlations in text but also solutions to enhance the quality of projects’ data themselves.

(This is an ongoing project, so comments, questions, and suggestions are welcome.).

Author
Affiliation

Luisa M. Mimmi

Published

October 29, 2024

Modified

April 5, 2025

MOTIVATION

I have always been fascinated by the idea of analyzing language as data and I finally found some time to study Natural Language Processing (NLP) and Text Analytics techniques.

For this learning project, I explore a dataset of World Bank Projects & Operations, with a focus on the text data contained in the Project Development Objective (PDO) section of World Bank’s projects (loans, grants, technical assistance). A PDO outlines, in synthetic form, the proposed objectives of operations, as defined in the early stages of the World Bank project cycle.

Normally, a few objectives are listed in paragraphs that are a couple sentences long. Table 1 shows two examples.

Table 1: Illustrative PDOs text in Projects’ documents
Project_ID Project_Name Project_Development_Objective
P127665 Second Economic Recovery Development Policy Loan This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate.
P179010 Tunisia Emergency Food Security Response Project To (a) ensure, in the short-term, the supply of (i) agricultural inputs for farmers to secure the next cropping seasons and for continued dairy production, and (ii) wheat for uninterrupted access to bread and other grain products for poor and vulnerable households; and (b) strengthen Tunisia’s resilience to food crises by laying the ground for reforms of the grain value chain.

The dataset also includes some relevant metadata about the projects, including: country, fiscal year of approval, project status, main sector, main theme, environmental risk category, or lending instrument.s

I retrieved the data on this page WBG Projects. Such data is classified by the World Bank as “public” and accessible under a Creative Commons Attribution 4.0 International License.

DATA

The original dataset, retrieved on August 31, 2024, included 22569 World Bank projects approved from Fiscal Year 1947 through 2025. Approximately 68.7%—15495 projects—had a viable Project Development Objective (PDO) text (i.e., not blank or labeled as “TBD”, etc.), all approved after the mid 1980s. From this group, some projects were excluded due to missing key variables.

This left 11278 projects as usable observations for analysis.

Interestingly, within this refined subset, 2461 projects share only 1097 unique PDOs: recycled PDOs often appear in follow-up projects or components of a larger parent project.

Finally, from these 11278 projects, a representative sample of 4425 projects with PDOs was selected for further analysis.

First, it is important to notice that none of the 3676 projects approved before FY 1984 had any PDO text available (evidently, it became a requirement only later on).

The exploratory analysis of the 11278 projects with PDO text revealed some interesting findings:

  1. PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
  2. PDO text missingness: besides 7030 projects with missing (i.e. NA) PDOs, 44 projects had some invalid PDO values, namely:
    • 12 have PDO as one of: “.”,“-”,“NA”, “N/A”
    • 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
    • 20 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
    • 5 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “XXX,”a”

Of the remaining 5 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:

  • 1420 projects without “project status”
  • 2420 projects without “board approval FY”
  • 377 projects approved in FY >= FY2024 (for incomplete approval stage)

Lastly (and this was quite surprising to me) the remaining, viable 11278 unique projects, were matched by only 9914 unique PDOs! In fact, 2461 projects share 1097 NON-UNIQUE PDO text in the clean dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”

In sum, the cleaning process yielded a usable set of 11278 functional projects, which was split into a training subset (4425) to explore and test models and a testing/validation subset (6853), held out for post-prediction evaluation.

Preprocessing the PDO text data

Cleaning text data entails extra steps compared to numerical data. A key process is tokenization, which breaks text into smaller units like words, bigrams, n-grams, or sentences. After that, a common cleaning task is normalization, where text is standardized (e.g., converting to lowercase). Similarly, data reduction techniques like stemming and lemmatization simplify words to their root form (e.g., “running,” “ran,” and “runs” become “run”). This can help to reduce dimensionality, especially with very large datasets, when the word form is not relevant.

Upon tokenization, it is very common to remove irrelevant elements like punctuation or stop words (unimportant words like “the”, “ii)”, “at”, or repeated ones in context like “PDO”) which add noise to the data.

In contrast, data enhancement techniques like part-of-speech (POS) tagging add value by identifying grammatical components, allowing focus on meaningful elements like nouns, verbs, or adjectives.

TERM FREQUENCY PATTERNS

Figure 1 shows the most recurrent tokens and stems in the PDO text data.

Words and stems

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (as they have been combined by root). Despite the pre-processing of PDOs’ text data, these aren’t particularly informative words.

Figure 1

Bigrams

Figure 2 shows the most frequent bigrams in the PDO text data. The top-ranking bigrams align with expectations, featuring phrases like “increase access”, “service delivery”, “institutional capacity”, “poverty reduction” at the top. Notably, while “health” appears in several bigrams (e.g., “health services”, “public health”, “health care”), “education” is absent from the top 25. Another noteworthy observation is the frequent mention (over 100 instances) of “eligible crisis”, which was somewhat unexpected.

Figure 2

Trigrams

Figure 3 shows the most frequent trigrams in the PDO text data. Here, the recurrence of phrases involving “health” is reiterated, along with a few phrases revolving around “environmental” goals, as well as terms that inherently belong together: like “water resource management”, “social safety net”, etc..

Figure 3

Sectors in the PDO text

To focus on a meaningful set of tokens, I examined the frequency of sector-related terms within the PDO text data. To capture the broader concept of “sector,” I created a comprehensive SECTOR variable that encompasses all relevant words within an expanded definition.

The “sector” term discussed here is not the sector variable available in the data, but it is an artificial construct reflecting the occurrence of terms referred to the same sector semantic field. Besides conceptual association, these definitions are rooted in the World Bank’s own classification of sector and sub-sector.

Below are the “broad SECTOR” definitions used in this analysis:

  • WAT_SAN = water|wastewater|sanitat|sewer|sewage|irrigat|drainag|river basin|groundwater
  • TRANSPORT = transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port
  • URBAN = urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town
  • ENERGY = energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation
  • HEALTH = health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|hiv|aids|tb|maternal|clinic|nutrition
  • EDUCATION = educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog
  • AGR_FOR_FISH = agricultural|agro|fish|forest|crop|livestock|fishery|land|soil
  • MINING_OIL_GAS = minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon
  • SOCIAL_PROT = social protec|social risk|social assistance|living standard|informality|insurance|social cohesion|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav
  • FINANCIAL = bank|finan|investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech
  • ICT = information|communication|ict|internet|telecom|cyber|data|ai|artificial intelligence|blockchain|e-learn|e-commerce|platform|software|hardware|digital
  • IND_TRADE_SERV = industry|trade|service|manufactur|tourism|trade and services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor
  • INSTIT_SUPP = government|public admin|institution|central agenc|sub-national gov|law|justice|governance|policy|regulation|public expenditure|public investment|public procurement
  • GENDER_EQUAL = gender|women|girl|woman|femal|gender equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance
  • CLIMATE = climate chang|environment|sustain|resilience|adaptation|mitigation|green|eco|eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab

The occurrence trends over time for key sector terms are shown in Figure 4.

Interestingly, all the broadly defined “sector term” in the PDO present one or more peaks at some point in time. For the (broadly defined) HEALTH sector, it is likely that Covid-19 triggered the peak in 2020. What about the other sectors? What could be the driving reason?

Figure 4

A possible explanation is that the PDOs may echo themes from the World Development Reports (WDR), the World Bank’s flagship annual publication that analyzes a key development issue each year. Far from being speculative research, each WDR is grounded in the Bank’s field-based insights and, in turn, it informs the Bank’s policy and operational priorities. This would suggest a likely alignment between WDR themes and project objectives in the PDOs.

To some extent, visual exploration (see examples below) seems to support this hypothesis: thematically relevant WDRs consistently appear in close proximity to peaks in sector-related term frequencies. However, further validation is necessary. Additionally, preparing each WDR typically takes 2-3 years, so a temporal alignment with project documents may include some lag.

Examples of sectors-term trend

Figure 5 shows a “combined sector” that is quite broadly defined (AGRICULTURE, FORESTRY, FISHING) with the highest peak in 2010, two years after the publication of the WDR on “Agriculture for Development”. Perhaps the “alignment” hypothesis is not very meaningful with such a broadly defined sector.

Figure 5

Figure 6, tracking frequency of CLIMATE-related terms, shows how the highest peak coincided with the publication of the WDR on “Development and Climate Change” in 2010.

Figure 6

Figure 7 reports two WDR publications relevant to EDUCATION, which seemingly preceded two peaks in the sector-related terms in the PDOs:

  • in 2007, on “Development and the Next Generation”
  • in 2018, on “Learning to Realize Education’s Promise”
Figure 7

Figure 8 shows that the highest frequency of terms related to GENDER EQUALITY was instead recorded a couple of years before the publication of the WDR on “Gender Equality and Development” in 2012.

Figure 8

Comparing PDO text against variable sector

The available data includes not only text but also relevant metadata, such as the sector1 variable, which captures the project’s primary sector. Do the terms in the PDO text align with this sector label? To examine this, I applied the two-sample Kolmogorov-Smirnov test to compare the distribution of sector-related terms in the PDO text with the distribution of sector1.

The Kolmogorov-Smirnov test is non-parametric and makes no assumptions about the underlying distributions, making it a versatile tool for comparing distributions. The null hypothesis is that the two samples are drawn from the same distribution. Hence, if the p-value is less than the significance level (0.05), the null hypothesis is rejected, suggesting the observed distributions are in fact different. The test statistic is the maximum difference between the cumulative distribution functions (CDF) of the two samples.

  • KS statistic: The vectors of observed distributions have been rescaled (bringing n_pdo and n_tag to a [0, 1] range before applying the Kolmogorov-Smirnov (KS) test). This is useful when distributions differ substantially in scale or units, as it makes them directly comparable in relative terms.

As shown in Table 2, the results indicate similar distributions across most sectors. This is promising, as it suggests that in cases where metadata is lacking, sector assignments can be reasonably inferred from the PDO text.

Table 2: Comparing the freqeuncy distributions of SECTOR in text and metadata
SECTORS KS statistic KS p-value Distributions
MINING_OIL_GAS 0.5882 0.0030 Dissimilar
ENERGY 0.4091 0.0452 Dissimilar
EDUCATION 0.3182 0.1836 Similar
TRANSPORT 0.3182 0.1976 Similar
HEALTH 0.2273 0.6009 Similar
ICT 0.2000 0.7909 Similar
WAT_SAN 0.1818 0.8479 Similar

Below is a graphical representation of two illustrative sectors, showing the most similar and the most dissimilar distributions of the sector as deducted form text data, versus the proper metadata sector labeling.

Figure 9 shows the distributions of the TRANSPORT sector in the PDOs’ text and in the metadata. The two distributions are the most similar, as confirmed by the Kolmogorov-Smirnov test with a p-value of 0.641.

Figure 9

Figure 10 compares visually the distributions of the ENERGY sector in the PDOs’ text data and the metadata. The two distributions are the most dissimilar, as the Kolmogorov-Smirnov test confirms with a p-value of 0.0001.

Figure 10

Comparing PDO text against variable amount committed

A similar question is: do word trends observed in PDOs also reflect the allocation of funds by sector? I explored this question with the same approach as before, but this time I compared the distribution of sector-related terms in the PDOs’ text against the distribution of the sum of the amount committed in corresponding projects (i.e. filtered by sector1 category). Given the very different ranges, I compared rescaled values (using the Kolmogorov-Smirnov two-sample test) to evaluate the independence of these two distributions.

As shown in Table 3, the results indicate less homogeneity of the distributions across key sectors, somthing that could be further investigated.

Table 3: Comparing the distributions of SECTOR in text and in corresponding $$ committed
SECTORS KS statistic KS p-value Distributions
ICT 0.6818 0.0000 Dissimilar
MINING_OIL_GAS 0.5909 0.0007 Dissimilar
EDUCATION 0.5000 0.0069 Dissimilar
ENERGY 0.2727 0.3867 Similar
HEALTH 0.2727 0.3937 Similar
WAT_SAN 0.2273 0.6276 Similar
TRANSPORT 0.2273 0.6324 Similar

Let us pick a couple of examples of specific sectors to check visually.

WATER & SANITATION sector: words v. funding

The distributions in the “WATER & SANITATION” sector are among the most similar pairs (K-S test p-value is = 0.4218).

Figure 11

ICT sector: words v. funding

The distributions in the ICT sector are among the least similar (K-S test p-value is = 0.0001).

Figure 12

Concordances: a.k.a. keywords in context

Another useful analysis that can be done exploring text data refers to concordance, which enables a closer look at the context surrounding a word (or combination of words). This approach can help clarify the word’s specific meaning or reveal underlying patterns in the data.

The bigram “eligible crisis” in the PDOs

For instance, among the most frequent bigrams (two-word combinations) in the PDO text (illustrated in Figure 2), the phrase “eligible crisis” stands out. Besides appearing in the PDOs of 112 projects, this phrase is often used in a similar context. Specifically, in 32% of these cases, it is paired with phrases like “respond promptly and effectively” or “immediate and effective response”. As shown in Table 4, this suggests a sort of recurring standard phrasing.

Table 4: Context of the bigram “eligible crisis” in the PDOs
WB Project ID Excerpt of PDO Sentences with 'Eligible Crisis'
P179636 (...) and (iii) respond effectively in case of an eligible crisis or emergency.
P176982 (...) borrower’s territory; and (iii) in case of an eligible crisis or emergency, respond promptly and effectively to it.
P147827 (...) of associated institutions, and in case of an eligible crisis or emergency, respond promptly andeffectively to it.
P177816 (...) in project areas, and, in case of an eligible crisis or emergency, to respond promptly and effectively to
P125961 (...) an earlyemergency response in the event of an eligible crisis or emergency.
P156012 (...) health services, and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P171093 (...) communities and to provide immediate response to an eligible crisis or emergency as needed.
P158231 (...) communities and to provide immediate response to an eligible crisis or emergency as needed.
P147280 (...) and to respond effectively in case of an eligible crisis or emergency
P167512 (...) provide an immediate and effective response to an eligible crisis or emergency.

The bigram “climate change” in the PDOs

Another frequently occurring bigram is “climate change”, found in 92 PDOs. Table 5 displays words that commonly appear near this bigram. Notably, the word “mitigation” (which I associate with a more aspirational, long-term response) appears more frequently than “adaptation” (which I view as a more practical, short-term response). However, the ratio would flip considering that “resilience” may convey a similar practical intent as “adaptation”. Another interesting insight worth exploring further in the future.

Table 5: Frequent words near “climate change”
Near 'climate change' Count Percentage
vulnerability 23 33.8%
resilience 16 23.5%
mitigate 14 20.6%
adapt 9 13.2%
hazard 6 8.8%

Table 6 shows a few examples for each of the words most frequently found in the vicinity of the bigram “climate change”.

Table 6: Context of the bigram “climate change” in the PDOs
Near word (root) WB Project ID Closest Text
adapt
adapt P128137 (...) phase i of the disaster risk management and climate change adaptation project are to strengthen the ca pacity
adapt P091979 (...) arid and semi-arid lands to plan and implement climate change adaptation measures
adapt P120134 (...) support the gom's efforts to foster adaptation to climate change in the water sector, contributing to long-term sustainable
hazard
hazard P177124 (...) islands to the impacts of natural hazards and climate change
hazard P146768 (...) buildings and infrastructure due to natural hazards or climate change impacts; and (b) increased capacity of oecs governments
hazard P123896 (...) agencies to financial protection from losses caused by climate change and geological hazards.
mitig
mitig P077763 (...) goal of the fund is to mitigate the climate change and demonstrate the possibilities of public -private partnerships
mitig P081743 (...) to help mitigate global climate change through certified carbon emission reductions (cers) of 178,000
mitig P111940 (...) developing actions to mitigate the effects of global climate change in the atlantic rain forest, ensuring the conservation
resil
resil P114294 (...) implement measures to enhance biodiversity resilie nce to climate change and protect forest carbon assets.
resil P170052 (...) iii) strengthening financial resilience to natural disasters and climate change
resil P178141 (...) in the city, strengthen the city’s resilience to climate change and enhance access to basic services in the
vulnerab
vulnerab P117871 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.
vulnerab P146768 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.
vulnerab P149259 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.

DATA QUALITY ENHANCEMENT

This section shifts focus to a new area of exploration: the possibility to enhance the metadata quality by predicting missing features in the World Bank project documents. The idea is to use the Project Development Objective (PDO) words as input to predict the missing categorical descriptors (sector, environmental risk category, etc.) for some of the observations . Table 7 shows some missing features in the source dataset.

Table 7: Missing features in source dataset
Variable N obs. N Distinct N Missing N Percent
sector1 4425 76 5 0.1%
theme1 4425 81 7 0.2%
env_cat 4425 6 1195 27%
ESrisk 4425 5 4014 90.7%

One candidate variable that could be predicted is env_cat (“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U), but, to simplify, I collapsed it into a binary outcome defined as “High-Med-risk” and “Low-risk-Othr” (as illustrated in Table 8).

Table 8: Binary outcome obtained from the env_cat variable
High-Med-risk Low-risk_Othr
A_high risk 315 0
B_med risk 1837 0
C_low risk 0 905
F_fin expos 0 122
Other 0 51
Missing 0 1195

Using ML models to predict a missing feature

The goal at hand has to do with text classification, that is assigning categories to some observations. To predict a missing feature based on a mix of text data and other available predictors, several machine learning (ML) algorithms can be applied. I tested a few suitable algorithms.

The sample splitting (necessary in ML to save testing dataset for model evaluation) was done based on the availability of the env_cat variable. The sample was actually split into three groups:

  1. Training set (with env_cat available) 2,264 observations
  2. Testing set (with env_cat available) 972 observations
  3. Validation set (with env_cat missing) 1,167 observations

Choosing the ML algorithm

To predict the missing binary categorical outcome env_cat_f2, I tried several models, including: Lasso logistic regression (with different specifications including only text or a mix of text and other predictors) and Naive Bayes classification (Here I only report the results, but details can be found on this webpage). Since text data is sparse and high-dimensional, it is critical to perform some pre-treatment of the features (i.e. the explanatory variables) before modeling.

  • LASSO models (for logistic regression) is an approach that basically defines how much of a penalty to put on some features in order to select only the most useful out of all the original possible variables (tokens). It is a good choice when dealing with a high-dimensional dataset, like text data.

  • Naïve Bayes classification is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.

Other supervised ML algorithms could be used in this case, such as Random Forest, Support-Vector Machines, K-Nearest Neighbors, but they were not tested here.

The steps to predict the missing feature

  1. Outcome label engineering: Define what to predict (outcome variable, \(y\)), and its functional form (binary or multiclass, log form or not if numeric)./

  2. Sample design: Select the observations to use. In ML this is typically done by splitting the sample into training and testing sets.

  3. Feature Engineering: Define the input data (predictors, \(X\)) and their format. Here, text data was combined with other predictors (e.g. sector, region, FY approved, etc.) to create a feature matrix.

    • Text preprocessing: The text data was preprocessed by tokenization, filtering of tokens by frequency, removal of stopwords, weighting via TF-IDF (Term Frequency-Inverse Document Frequency), to make it suitable for ML algorithms.

  4. Model selection and fitting: The models were trained on the training set.

    • Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters tuning, typically done while training the model).

  1. Prediction: The best model was used to predict the missing env_cat_f2 and evaluate the model’s performance on the hold-out sample (testing set).

  2. Evaluation: The predictions were evaluated on the testing set based on performance metrics:

    • accuracy, which is the proportion of correct predictions, and
    • ROC-AUC (Receiver Operating Characteristic - Area Under the Curve), which summarizes how well the model can distinguish between classes.

  3. Interpretation: The model was interpreted to understand which features were most important in predicting the outcome.

ML is an iterative process, so it is common to revise (some of) the above steps multiple times to refine the model.

Models and Results

Table 9 reports the specifications of the models and their performance.

Table 9: Comparison of models and results for binary outcome
Algorithm Features Specification Accuracy ROC_auc
LASSO logistic regression Text only env_cat_f2 ~ pdo 0.750 0.777
LASSO logistic regression (more preprocessing) Text only env_cat_f2 ~ pdo 0.762 0.807
LASSO logistic regression (more preprocessing) Text + other predictors env_cat_f2 ~ pdo + sector_f + regionname + FYapprov 0.790 0.850
Naïve Bayes classification Text + other predictors env_cat_f2 ~ pdo + sector_f + regionname + FYapprov 0.691 0.784

The best model performance was achieved by the LASSO logistic regression model that combined both PDOs’ text and some available metadata information to predict the missing env_cat_f2 in the testing set. The model achieved an accuracy of 0.79 and an ROC-AUC of 0.85, whereas:

  • accuracy is the proportion of correct predictions made by the model out of all predictions or, in other words, how often the model is correct overall.
  • ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) goes further by evaluating the model’s ability to distinguish between classes across various thresholds. It summarizes how well the model can separates the classes, providing a more nuanced view of its performance, especially useful when the class distribution is uneven.

Performance of the preferred ML model

Figure 13 presents the confusion matrix for the preferred ML model used to predict the missing environment risk category assigned to World Bank projects. This matrix shows the distribution of true and predicted classifications. Ideally, a high-performing model would have most observations (or darker shading) along the diagonal, indicating correct classifications—specifically, true positives in the top-left quadrant and true negatives in the bottom-right quadrant.

In this case, the model performs well in predicting the environment risk category for the High-Med group but struggles with the Low & Other group. Many of these cases are incorrectly classified as High-Med Risk (false positives). This result is understandable, as the Low & Other category is more loosely defined and even includes Missing observations (which, in hindsight, could have been excluded from the prediction).

Figure 13

Most important features for prediction

It’s also insightful to examine which coefficients are most influential in the model. This can be done visually through the feature importance plot (see Figure 14).

The feature importance plot displays the top 50 predictors of the environmental risk (binary) category, ranked by their impact in a LASSO logistic regression model. For clarity, predictors are divided according to the risk level they predict. As expected, given the structure of the data, words from the PDO text (those variables starting with pdo_*) are among the most important predictors. However, other predictors also play a significant role, such as sector_f_TRANSPORT (left panel), regionname, and sector_f_FINANCIAL (right panel).

Figure 14: Top 50 most important features in the preferred ML model

Prediction and Interpretation

While the model’s prediction performance is not particularly remarkable, it is sufficient to illustrate the potential of this analysis to enhance the quality of incomplete datasets. With further improvements in preprocessing, feature engineering, algorithm selection, and hyperparameter tuning, there is significant potential to optimize a similar ML model.

Although not reported here, I also explored predicting a multiclass outcome (sector, grouped into 7 levels). However, the results were less favorable compared to the binary classification. This outcome is expected, as multiclass classification is inherently more challenging, particularly with imbalanced data or limited sample sizes.

CONCLUSIONS

  • This project was primarily a proof-of-concept for learning purposes, so optimizing ML performance and conducting in-depth data analysis were not priorities. Nevertheless, it showcased the potential of applying NLP techniques to unstructured text data, uncovering insights such as:

    • identifying trends in sector-specific language and topics over time,
    • revealing unexpected patterns and relationships, like recurring phrases or topics,
    • enhancing text classification and metadata tagging with ML models,
    • sparking additional text-based questions that could guide further research.
  • Future steps could include exploring explanations for observed patterns by combining this NLP analysis with other data sources (e.g., World Bank official statements or project data) and experimenting with advanced NLP techniques for topic modeling.

  • One pain point with this type of work is accessing document data efficiently. Even with the World Bank’s “Access to Information” policy, getting programmatic access to their text data is still tricky (no dedicated API, outdated pages, broken links). This could benefit from an approach similar to the accessible, well-maintained World Development Indicators (WDI) data.

  • With all the buzz around AI and Large Language Models (LLMs), this kind of analysis might seem like yesterday’s news. But I think there’s still huge, untapped potential for using NLP in development studies, policy analysis, and beyond—especially when it’s backed by domain expertise.

Acknowledgements

Below are some great resources—especially geared toward programmers—to learn and implement NLP techniques.

References

Engel, Claudia, and Scott Bailey. 2022. Text Analysis with R. https://cengel.github.io/R-text-analysis/.
Francom, Jerid. 2024. An Introduction to Quantitative Text Analysis for Linguistics: Reproducible Research Using R. 1st ed. London: Routledge. https://doi.org/10.4324/9781003393764.
Future Mojo, dir. 2022. Natural Language Processing Demystified - YouTube. https://www.youtube.com/playlist?list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.
Heiss, Andrew. 2022. “Text.” Data Visualization Course. 2022. https://datavizs22.classes.andrewheiss.com/example/13-example/#sentiment-analysis.
Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. First edition. Data Science Series. Boca Raton London New York: CRC Press. https://smltar.com/.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly. https://www.tidytextmining.com/.

Citation

BibTeX citation:
@report{mimmi2024,
  author = {Mimmi, Luisa M.},
  title = {Tracing Policy Signals in Text},
  series = {Working paper},
  date = {2024-10-29},
  url = {https://policylexicon.com/posts/PDO_eda.html/},
  langid = {en}
}
For attribution, please cite this work as:
Mimmi, Luisa M. 2024. “Tracing Policy Signals in Text.” NLP Study of World Bank Operational Documents with R. Working Paper. https://policylexicon.com/posts/PDO_eda.html/.