Customization🔗︎
This chapter guides you through the process of adding custom components to the
readnext
package tailored to your needs. While certain parts of the codebase
are not intended for modification to maintain lean code (e.g., the features of
the Citation Recommender), others are designed for easy extension.
We utilize the Strategy Pattern to ensure adherence to the open-closed principle: New components can be added without modifying any of the existing code.
In this chapter, we present three hands-on examples where customizing the
readnext
package is particularly straightforward. We will:
- Add a new tokenizer for the Language Recommender.
- Add a new language model for the Language Recommender.
- Add a new evaluation metric.
By implementing a predefined interface, the new components can serve as drop-in replacements for the existing ones throughout the codebase.
Adding Evaluation Metrics🔗︎
All evaluation metrics for a list of recommendations inherit from the abstract
EvaluationMetric
class.
Source Code
Explore the source code of the EvaluationMetric
class
here.
To comply with the interface of the parent class, all evaluation metrics must
implement the score()
and from_df()
methods:
- The
score()
method accepts any generic Sequence type or polars Series of integers or strings as input and computes a scalar value from it. - The
from_df()
method takes the original documents dataframe with label columnsarxiv_labels
andinteger_label
as input and applies thescore()
method to one of these columns.
Example🔗︎
As an illustrative example, we implement the new FirstRelevant
evaluation
metric that takes a list of 0/1 for irrelevant/relevant recommendations as input
and returns the position of the first relevant recommendation.
Thus, lower (positive) values are better: If the top recommendation is indeed relevant, the score is 1, if the second recommendation is relevant but the first is not, the score is 2, and so on. If none of the recommendations are relevant, the score is set to 0.
The following code snippet shows the implementation of the FirstRelevant
class:
from dataclasses import dataclass
from readnext.evaluation.metrics import EvaluationMetric
from readnext.utils.aliases import DocumentsFrame
@dataclass
class FirstRelevant(EvaluationMetric):
"""
Computes the position of the first relevant item in a list of integer
recommendations.
"""
@classmethod
def score(cls, label_list: IntegerLabelList) -> int:
if not len(label_list):
return 0
for k, label in enumerate(label_list, 1):
if label == 1:
return k
return 0
@classmethod
def from_df(cls, df: DocumentsFrame) -> int:
return cls.score(df["integer_label"])
The score()
method first checks if the provided sequence is empty (in which
case it returns early) and then uses the native enumerate()
function to check
if any label is equal to one. If so, this label position is returned. If not,
the score is set to 0.
Info
The score()
method is one of the rare cases where we could have used the
else
keyword of a for
loop: If the loop is not exited early, i.e. if no
label is equal to 1, the else
block is executed and returns 0.
The from_df()
method simply calls the score()
method on the integer_label
column of the documents dataframe.
Our new FirstRelevant
metric can now be used anywhere in the codebase in place
of the existing AveragePrecision
and CountUniqueLabels
evaluation metrics.
Adding Tokenizers🔗︎
In contrast to the case for evaluation metrics, tokenizers can inherit from two
different abstract base classes, depending on whether they tokenize text into
string tokens (such as spacy
) or into integer token ids (such as all
tokenizers from the transformers
library):
- If they tokenize text into string tokens, they inherit from the
Tokenizer
class. To comply with the interface of the parent class, they must implement thetokenize_single_document()
method, which takes a single string as input and returns a list of string tokens. Then, they inherit thetokenize()
method which is used to construct a polars dataframe with atokens
column that contains the tokenized documents. - If they tokenize text into integer token ids, they inherit from the
TorchTokenizer
class. Similar to the case above, all child classes must implement atokenize_into_ids()
method that either takes a single string as input and returns a list of integer token ids or a list of strings as input and returns a list of lists of integer token ids. Then, they inherit thetokenize()
method which is used to construct a polars dataframe with atoken_ids
column that contains the tokenized documents.
Source Code
Take a look at the source code of the Tokenizer
class
here
and the source code of the TorchTokenizer
class
here.
Example🔗︎
As an example, we add the NaiveTokenizer
that simply converts all characters
to lowercase and splits the text at whitespace characters.
Since the NaiveTokenizer
tokenizes text into string tokens, it inherits from
the Tokenizer
class and must implement the tokenize_single_document()
method:
from dataclasses import dataclass
from readnext.utils.aliases import Tokens
from readnext.modeling.language_models import Tokenizer
@dataclass
class NaiveTokenizer(Tokenizer):
"""
Naive tokenizer that splits text at whitespace characters and converts all
characters to lowercase.
"""
def tokenize_single_document(self, document: str) -> Tokens:
return document.lower().split()
The new NaiveTokenizer
can now be used anywhere in the codebase in place of the
existing SpacyTokenizer
.
Adding Language Models🔗︎
Similar to Tokenizers, Language Models can inherit from two different abstract
base classes, depending on whether they are a transformers
model or not:
- If they are not a
transformers
model, they inherit from theEmbedder
class. In this case, they must implement thecompute_embedding_single_document()
andcompute_embeddings_frame()
methods. The former takes aTokens
(a list of strings) as input and returns anEmbedding
(a list of floats). The latter uses thetokens_frame
from the instance initialization and returns a polars DataFrame with two columns namedd3_document_id
andembedding
. - If they are a
transformers
model, they inherit from theTorchEmbedder
class. In this case, they do not have to implement any methods in the child class since the interface for alltransformers
models is identical. For the purpose of correct typing, any child class must only define thetorch_model
attribute with the correct type annotation.
Source Code
Take a look at the source code of the Embedder
class
here
and the source code of the TorchEmbedder
class
here.
Example🔗︎
As an example, we add the DummyEmbedder
that simply returns a list of zeros
for each document. To simulate a 300-dimensional embedding, we set the length of
the list to 300.
Since the DummyEmbedder
is not a transformers
model, it inherits from the
Embedder
class and must implement the compute_embedding_single_document()
and compute_embeddings_frame()
methods:
from dataclasses import dataclass
from readnext.utils.aliases import EmbeddingsFrame, Tokens
from readnext.modeling.language_models import Embedder
@dataclass
class DummyEmbedder(Embedder):
"""
Dummy embedder that returns a list of zeros for each document.
"""
def compute_embedding_single_document(self, document_tokens: Tokens) -> Embedding:
# although the `document_tokens` argument is not used, it is necessary
# to implement the interface properly
return [0] * 300
def compute_embeddings_frame(self) -> EmbeddingsFrame:
return self.tokens_frame.with_columns(
embedding=pl.Series([[0] * 300] * self.tokens_frame.height)
).drop("tokens")
The new DummyEmbedder
can now be used anywhere in the codebase in place of any
existing embedder that is not based on a transformers
model. This includes the
TFIDFEmbedder
, BM25Embedder
, Word2VecEmbedder
, GloveEmbedder
, and
FastTextEmbedder
classes.