pir_pipeline.linking.PIRLinker.PIRLinker¶
- class pir_pipeline.linking.PIRLinker.PIRLinker(records: list[dict] | DataFrame, sql: SQLAlchemyUtils)¶
Bases:
object
Class for linking PIR data
- __init__(records: list[dict] | DataFrame, sql: SQLAlchemyUtils)¶
Instantiate instance of PIRLinker object
- Args:
records (list[tuple] | list[dict]): Records to link sql (SQLAlchemyUtils): A SQLAlchemyUtils object for interacting with the database
Methods
__init__
(records, sql)Instantiate instance of PIRLinker object
Get the modal uqid for each question name, text, type and section combination
Make a direct link on question_id
fuzzy_link
([num_matches])Link questions using a Levenshtein algorithm
gen_uqid
(row)Generate a uqid
get_question_data
([which])Get data from the question table
join_on_type_and_section
(which)Execute many-to-many join on type and section
link
()Attempt to link records provided to records in the database.
Prepare data for insertion
Check for question data and get it if not present
remove_duplicates
()Update the uqids in the database
- consolidate_uqids() Self ¶
Get the modal uqid for each question name, text, type and section combination
- Returns:
Self: A PIRLinker object
- direct_link() Self ¶
Make a direct link on question_id
- Returns:
Self: PIRLinker Object
- fuzzy_link(num_matches: int = None) Self | DataFrame ¶
Link questions using a Levenshtein algorithm
- Args:
num_matches (int, optional): Number of potential matches to return. Defaults to None.
- Returns:
Self | pd.DataFrame: PIRLinker object or a dataframe containing potential matches.
- gen_uqid(row: Series) str | float ¶
Generate a uqid
- Args:
row (pd.Series): A pandas series containing question data
- Returns:
str | float: Unique question ID (uqid)
- get_question_data(which: str = 'all') Self ¶
Get data from the question table
- Args:
which (str): which can be used to specify what data is returned from the question table. Options include ‘all’, ‘linked’, and ‘unlinked’ which return data from the full question table, linked view, and unlinked view respectively. A custom query can also be specified using the which argument.
- Returns:
Self: PIRLinker object
- join_on_type_and_section(which: str)¶
Execute many-to-many join on type and section
- Args:
which (str): Which dataset should be the left-hand side of the join? Options include ‘unlinked’ and ‘data’.
- Returns:
Self: PIRLinker object
- link() Self ¶
Attempt to link records provided to records in the database.
- Returns:
Self: PIRLinker object
- prepare_for_insertion() Self ¶
Prepare data for insertion
Confirm that the data have the appropriate shape and update uqids.
- Returns:
Self: PIRLinker object
- question_data_check()¶
Check for question data and get it if not present
- update_unlinked() Self ¶
Update the uqids in the database
- Returns:
Self: PIRLinker object