PIR Pipeline Workflow

The PIR pipeline consists of three main steps:

  1. Raw data upload - data are uploaded to an S3 bucket manually.

  2. Data ingestion - raw data are processed and inserted into the PIR database.

  3. Question linking - an attempt is made to link any questions that are missing uqid algorithmically.

A diagram showing the PIR Pipeline as well as implementation details for the ingestion and linking processes are provided below.

Ingesting Data

Data ingestion is handled by an AWS Lambda function. When a new file is added to the S3 bucket associated with the PIRIngestor Lambda, it triggers the ingestion process. The response, program, and question tables are updated with new data. Logs are captured in AWS CloudWatch. The same process can be used to re-ingest a year that has already been processed, but it is important to note that re-ingesting a year can only update existing records or add new records. Records will never be removed from the database during the ingestion process.

Extraction

Files uploaded to S3 should be Excel workbooks (.xls or .xlsx file extension). The extraction process expects to find the following sheets in the target workbook:

  1. Sheets containing the word “Section”

    1. These sheets should contain the responses provided by programs to a given section of the PIR survey. The sheet should begin with the word “Section”, followed by a space and a single letter (e.g. “Section A”). Anything can follow the single letter. The first row should contain the question name. The second row should contain the question number or column name. All other rows should contain program-level data.

  2. One sheet called “Reference”

    1. This sheet should contain meta-data about the questions in each section. Crucial fields are category, section, subsection, question order, question number, question name, type, and question text.

  3. One sheet containing the word “Program” (i.e. “Program Details”)

    1. This sheet should contain meta-data about the respondents (programs) to the PIR survey. The data should be unique by grant number, program number, and program type.

Transformation

The following transformations are applied to the raw PIR data to prepare it for insertion into the PIR database.

Program

  1. Region is converted to numeric by extracting the region number from text of the form “Region \d”

Question

  1. When a question is duplicated, keep the first occurrence (by question order).

  2. Merge to the response data to ensure alignment.

Response

  1. The following steps are applied to each “Section” sheet:

    1. Set the second row of the sheet (where column names/question numbers are stored) to the column names.

    2. Extract question names from the first row of the sheet

    3. Reshape the data long so that each row corresponds to a program’s response on a particular question.

    4. Merge the question name onto the reshaped dataset.

  2. All sections are appended resulting in the response data.

  3. Question data is merged to the response data to ensure alignment.

Universal

The following transformations are applied universally:

  1. Columns are renamed. In particular, this involves converting the existing column name to snake case.

  2. ID variables are generated by hashing the appropriate columns:

    1. uid is generated by hashing grant number, program number, and program type.

    2. question_id is generated by hasing question number, and question name.

  3. Duplicate records are dropped

  4. Year is added

  5. Missing variables are added

  6. The data are subset to include only columns present in PIR database tables.

Loading

The data are validated using a pydantic model to ensure proper type alignment. Then data are inserted into the corresponding PIR database table.

Wrap-up

After ingestion, the successfully ingested workbooks are moved from the input folder to the processed folder.

Linking Questions

Once daily the PIRLinker Lambda will run. This function checks for questions missing a uqid and attempts to match them with other questions.

Methodology

The algorithm for matching questions across years looks at the three components of a question in the PIR data, question name, question number, and question text. It first assigns the question a unique ID value (which we call a question id) by hashing the combined values of question number and question name. This happens for all questions in the data as it is loaded into the database.

Once the data has been loaded and the questions have had their unique ID values assigned, the algorithm checks each question against each question from previous years in the database. If two questions have the same question id they are considered matched. Matched questions are then assigned a unique question id, separate from the question’s question id and meant to uniquely identify questions across time.

If a question has no exact matches on question id, then the algorithm performs a second search. First, the pool of potential matches is limited to questions that share the same type and are in the same section of the survey. Next, the algorithm checks whether any two of question name, question number, or question text, are identical for the base question and any questions in the match pool. For example, let’s compare the following two questions (NOTE: These are hypothetical questions, and don’t reflect the actual question data.):

Year

2021

2022

Question name

Cumulativ ely Enrolled Children

Cumulativ ely Enrolled Children

Question number

A.10

A.9

Question text

Total number of children enrolled in the program throughou t the year.

Total number of children enrolled in the program throughou t the year.

These questions, from 2021 and 2022, would be linked because they share the same values for their question name and question text, despite them not having the same question number.

If, however, these questions only shared one value (NOTE: The value of question text now differs.):

Year

2021

2022

Question name

Cumulativ ely Enrolled Children

Cumulativ ely Enrolled Children

Question number

A.10

A.9

Question text

Total children enrolled in the program throughou t the year.

Total number of children enrolled in the program throughou t the year.

The algorithm would not link these questions, even though they are likely to reference the same thing.

Reviewing and fixing bad links or linking questions that were missed in the linking process is accomplished via the PIR QA dashboard, a tool developed for analysts to address linking issues in the database.