pir_pipeline.linking.PIRLinker.PIRLinker

class pir_pipeline.linking.PIRLinker.PIRLinker(records: list[dict] | DataFrame, sql: SQLAlchemyUtils)

Bases: object

Class for linking PIR data

__init__(records: list[dict] | DataFrame, sql: SQLAlchemyUtils)

Instantiate instance of PIRLinker object

Args:

records (list[tuple] | list[dict]): Records to link sql (SQLAlchemyUtils): A SQLAlchemyUtils object for interacting with the database

Methods

__init__(records, sql)

Instantiate instance of PIRLinker object

consolidate_uqids()

Get the modal uqid for each question name, text, type and section combination

direct_link()

Make a direct link on question_id

fuzzy_link([num_matches])

Link questions using a Levenshtein algorithm

gen_uqid(row)

Generate a uqid

get_question_data([which])

Get data from the question table

join_on_type_and_section(which)

Execute many-to-many join on type and section

link()

Attempt to link records provided to records in the database.

prepare_for_insertion()

Prepare data for insertion

question_data_check()

Check for question data and get it if not present

remove_duplicates()

update_unlinked()

Update the uqids in the database

consolidate_uqids() Self

Get the modal uqid for each question name, text, type and section combination

Returns:

Self: A PIRLinker object

Make a direct link on question_id

Returns:

Self: PIRLinker Object

Link questions using a Levenshtein algorithm

Args:

num_matches (int, optional): Number of potential matches to return. Defaults to None.

Returns:

Self | pd.DataFrame: PIRLinker object or a dataframe containing potential matches.

gen_uqid(row: Series) str | float

Generate a uqid

Args:

row (pd.Series): A pandas series containing question data

Returns:

str | float: Unique question ID (uqid)

get_question_data(which: str = 'all') Self

Get data from the question table

Args:

which (str): which can be used to specify what data is returned from the question table. Options include ‘all’, ‘linked’, and ‘unlinked’ which return data from the full question table, linked view, and unlinked view respectively. A custom query can also be specified using the which argument.

Returns:

Self: PIRLinker object

join_on_type_and_section(which: str)

Execute many-to-many join on type and section

Args:

which (str): Which dataset should be the left-hand side of the join? Options include ‘unlinked’ and ‘data’.

Returns:

Self: PIRLinker object

Attempt to link records provided to records in the database.

Returns:

Self: PIRLinker object

prepare_for_insertion() Self

Prepare data for insertion

Confirm that the data have the appropriate shape and update uqids.

Returns:

Self: PIRLinker object

question_data_check()

Check for question data and get it if not present

update_unlinked() Self

Update the uqids in the database

Returns:

Self: PIRLinker object