Data

Where do these files come from?

The raw and derived data files are accessible from the Github Repo for this project.

WB Projects & Operations

World Bank Projects & Operations were obtained from:

The Accessibility Classification is public under Creative Commons Attribution 4.0

Process to ingest & preprocess raw PDO text data

  1. Retrieve ALL WB projects (22,571) listed (approval obtained or requested between FY 1947 and 2026 as of 31/08/2024) using the Excel button on this page: WBG Projects
  2. Split the dataset and keep only projs_train (~50% of projects with PDO text, i.e. 5,637 PDOs)
  3. Clean the projs_train dataset
  4. Further processing of the column pdo

Input data files

These files in the folder data/raw_data/ are downloaded from the World Bank website.

List of Source Files and Retrieval Dates
Source File Name Details Retrieved
project2/all_projects_as_of29ago2024.xls 22,571 obs (projects) 29 of August 2024
project3/all_projects_as_of31mar2025.xlsx (Sheet Projects) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Themes) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Sectors) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet GEOLocations) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Financers) 22,210 obs (projects) 31 of March 2025
wdr.rds 45 obs (WDRs) from 2022, then completed manually

Output data files

These files in the folder data/derived_data/ are created in different scripts and saved here to be reused in other scripts.

Key `.rds` Files, Their Sources, and Contents
File *.rds name Source File Name Details
wdr.rds [from OLD repo ~/Github/slogan_old/]
- ...OLD/_my_stuff/WDR-data-ingestion.Rmd
- result as ...OLD/data/raw_data/WDR.rds
- text processing on WDR abstracts
- ...OLD/01b_WDR_data-exploration_abstracts.Rmd
- result as ...OLD/data/raw_data/wdr.rds
as df (44)
problem, API changed — not reproducible
~ like text processing on PDOs
all_proj_t.rds analysis/01a_WB_project_pdo_prep.qmd 11,279 obs (projects)
projs_train.rds analysis/01a_WB_project_pdo_prep.qmd 5,637 obs (projects)
4,425 if < 2001 FY
projs_test.rds analysis/01a_WB_project_pdo_prep.qmd 2,821 obs (projects)
projs_val.rds analysis/01a_WB_project_pdo_prep.qmd 2,820 obs (projects)
pdo_train_to_tag.rds analysis/01a_WB_project_pdo_prep.qmd 5,637 obs (input)
Post split
pdo_train_tagged.rds analysis/01a_WB_project_pdo_prep.qmd LARGE `cnlp` object
intermediate step (output)
pdo_train_t.rds analysis/01a_WB_project_pdo_prep.qmd 314,821 obs (tokens)
248,256 if < 2001 FY
projs_train2.rds analysis/01b_WB_project_pdo_EDA.qmd 4,425 obs (projects)
changed
pdo_train2_t.rds analysis/01b_WB_project_pdo_EDA.qmd 252,705 obs (tokens)
changed
custom_stop_words.rds analysis/01b_WB_project_pdo_EDA.qmd as vector
custom_stop_words_df.rds analysis/01b_WB_project_pdo_EDA.qmd as df
wdr2.rds analysis/01b_WB_project_pdo_EDA.qmd as df (46)
[added WDR 2023/2024 manually]