CNLearn Schemas - Creating Some Pydantic Models and Testing Them

You won’t skip class today..That’s exactly what we’ll discuss today. This is a (somewhat) rewrite of the og post here, but many things have changed since. For one, I am using Pydantic for my vocabulary structures. Why? Well it provides validation and it also integrates nicely with the Web CNLearn version. They have nice JSON exporting and a few other things. I want the back code to be as decoupled as possible from the GUI. That way, if I decide to switch from Kivy at some point in the future (I have no intention but who knows what I will want to learn next), hopefully it’s easier…hopefully.

(yes I know there’s been some controversy and discussion regarding pydantic’s type annotations and upcoming Python 3.10 version with PEP 563, I won’t get into politics here. I’m sure everyone will be able to work it out. One day I’d like to contribute to that as well: the development I mean, not the arguing and insulting)

Vocabulary Structures

Ok so what kind of structures will I have? I will have a Common structure which inherits from BaseClass and ABC. A Pydantic model is simply one that inherits from BaseModel. That then provides validation for the fields that we have as well as many useful methods. If you’re curious about what inheriting from BaseModel actually does, have a look at the source code. It does a lot of work when creating a model and when setting attributes. We could have written our own similar version but no need. So we are inheriting from BaseModel. Why are we also inheriting from ABC? And what is an ABC? You don’t know your alphabet? It goes like this: A for Abstract, B for Base, C for Class, D for duck-typing, etc. So what are abstract base classes? There’s some information here and here. Essentially they will contain abstract methods that need to be implemented by classes inheriting from it. If you’re familiar with other OOP languages, like Java, it’s somewhat like an interface. At the beginning I won’t make heavy use of them but at some point I will.

The Common class inherits from BaseModel and ABC: BaseModel for the validation of attributes, ABC for the abstract methods that child classes will have to implement. The Character and Word classes will then inherit from Common. The Radical class, however, will not. Let’s have a look at the Common class:

class Common(BaseModel, ABC):
    """
    Common class.
    The Character and Word classes will derive form it.
    Its methods are implemented as ABC methods that its children will have
    to define.
    """

    id: Optional[int]
    definitions: str
    stroke_diagram: Optional[str] # not yet implemented, will likely be a reference
    # to a SVG file (e.g. 37683.svg)
    simplified: str
    traditional: str
    pinyin_num: str
    pinyin_accent: str
    pinyin_clean: str
    also_pronounced: Optional[str]
    also_written: Optional[str]
    classifiers: Optional[str]
    frequency: int


    class Config:
        orm_mode = True

    @abstractmethod
    def list_components(self):
        pass

    @abstractmethod
    def list_words(self):
        pass

    @abstractmethod
    def list_sentences(self):
        pass

    @abstractmethod
    def get_traditional(self):
        pass

    @abstractmethod
    def get_simplified(self):
        pass

    @abstractmethod
    def get_pinyin(self, pinyin_type):
        pass

So what are the required fields? ID (will be an integer from the database), definitions (string), simplified, traditional, pinyin_num, pinyin_accent, pinyin_clean and frequency. Now you might be thinking: the SQLAlchemy model we wrote for the Characters table didn’t have all of those. What are you doing??? Well, when creating the Character class it will extract stuff from the Words table as well. That’s why we’ve been implementing some of those CRUD methods.

Let’s also look at the Radical and Word structures. In the Radical class we have:

class Character(Common):
    character_type: Optional[CharacterType] # optional for now
    radical: Optional[str]
    decomposition: Optional[str]
    etymology: Optional[Dict]

It also implements all the methods required by our ABC but they all currently return None. What about the Word structure?

class Word(Common):
    pinyin_no_spaces: str
    components: Optional[List[Character]]
    radical: Optional[Radical] # if it's one character word will have
    hsk: Optional[HSKLevel] # some words won't have this

So what is the flow of the programme? A Chinese string is entered -> it gets segmented into Words. Some of them are multiple-character and some are one-character words. Let’s think of the 1 character words first. They will have a Character structure but with information also taken from the Words table. For a word with multiple characters, it will be a Word structure. It will, however, contain one-character word components which are defined as previously mentioned.

OK so we have these structures now. Are you saying you want to test them? I agree! Let’s create a test_schemas.py file in the tests directory.

import pytest
from sqlalchemy.orm.session import Session
from src.schemas.structures import Character, Word
from src.db.models import Word as Word_model, Character as Character_model
from src.db.crud import (
    get_simplified_word,
    get_word_and_character,
)
from src.db.settings import SessionLocal


@pytest.fixture
def db() -> Session:
    """
    Returns a reusable database session.
    """
    session: Session = SessionLocal()
    return session


@pytest.fixture
def my_character_1() -> Character:
    return Character(
        simplified="不",
        traditional="不",
        pinyin_num="bu4",
        pinyin_accent="bù",
        pinyin_clean="bu",
        definitions="(negative prefix); not; no",
        decomposition="⿱一?",
        etymology={
            "type": "ideographic",
            "hint": "A bird flying toward the sky\u00a0\u4e00",
        },
        radical="一",
        frequency=459467,
    )


@pytest.fixture
def my_character_2() -> Character:
    return Character(
        simplified="满",
        traditional="滿",
        definitions="to fill; full; filled; packed; fully; completely; quite; to reach the limit; to satisfy; satisfied; contented",
        pinyin_num="man3",
        pinyin_accent="mǎn",
        pinyin_clean="man",
        decomposition="⿰氵⿱艹两",
        etymology={
            "type": "pictophonetic",
            "phonetic": "\u34bc",
            "semantic": "\u6c35",
            "hint": "water",
        },
        radical="氵",
        frequency=10702,
    )


@pytest.fixture
def my_word_1(my_character_1, my_character_2) -> Word:
    return Word(
        simplified="不满",
        traditional="不滿",
        definitions="resentful; discontented; dissatisfied",
        pinyin_num="bu4 man3",
        pinyin_accent="bù mǎn",
        pinyin_clean="bu man",
        pinyin_no_spaces="buman",
        # for now I am manually specifying what the components are
        # later they will be created automatically
        components=[my_character_1, my_character_2],
        frequency="3157",
    )


# let's test some of the dictionaries created


def test_character1_dictionary(my_character_1: Character):
    """
    Tests the fields from Character 1.
    """
    character: Character = my_character_1
    assert character.definitions == "(negative prefix); not; no"
    assert character.simplified == character.traditional == "不"
    assert character.pinyin_accent == "bù"
    assert character.pinyin_num == "bu4"
    assert character.pinyin_clean == "bu"


def test_word_1_components(
    my_word_1: Word, my_character_1: Character, my_character_2: Character
):
    """
    Tests the component characters of a Word schema.
    """
    word: Word = my_word_1
    assert my_character_1 in word.components and my_character_2 in word.components


def test_bu_character_database(db, my_character_1):
    """
    Tests the results for the 不 character from the database through the Character schema
    """
    bu_word, bu_character = get_word_and_character(db, simplified="不")
    bu_character_schema: Character = Character.from_orm(bu_word)
    bu_character_schema.decomposition = bu_character.decomposition
    bu_character_schema.etymology = bu_character.etymology
    bu_character_schema.radical = bu_character.radical
    assert bu_character_schema.traditional == my_character_1.traditional
    assert bu_character_schema.simplified == my_character_1.simplified
    assert bu_character_schema.pinyin_num == my_character_1.pinyin_num
    assert bu_character_schema.pinyin_accent == my_character_1.pinyin_accent
    assert bu_character_schema.pinyin_clean == my_character_1.pinyin_clean
    assert bu_character_schema.definitions == my_character_1.definitions
    assert bu_character_schema.decomposition == my_character_1.decomposition
    assert bu_character_schema.etymology == my_character_1.etymology
    assert bu_character_schema.radical == my_character_1.radical


def test_man_character_database(db, my_character_2):
    """
    Tests the results for the 满 character from the database through the Character schema
    """
    man_word, man_character = get_word_and_character(
        db, simplified="满", pinyin_clean="man"
    )
    man_character_schema: Character = Character.from_orm(man_word)
    man_character_schema.decomposition = man_character.decomposition
    man_character_schema.etymology = man_character.etymology
    man_character_schema.radical = man_character.radical
    assert man_character_schema.traditional == my_character_2.traditional
    assert man_character_schema.simplified == my_character_2.simplified
    assert man_character_schema.pinyin_num == my_character_2.pinyin_num
    assert man_character_schema.pinyin_accent == my_character_2.pinyin_accent
    assert man_character_schema.pinyin_clean == my_character_2.pinyin_clean
    assert man_character_schema.definitions == my_character_2.definitions
    assert man_character_schema.decomposition == my_character_2.decomposition
    assert man_character_schema.etymology == my_character_2.etymology
    assert man_character_schema.radical == my_character_2.radical


def test_buman_character_database(db, my_word_1):
    """
    Tests the results for the 不满 word from the database through the Word schema
    """
    bu_man_word_list: Word_model = get_simplified_word(db, simplified="不满")
    bu_man_word_schema = Word.from_orm(bu_man_word_list[0].Word)
    assert bu_man_word_schema.traditional == my_word_1.traditional
    assert bu_man_word_schema.simplified == my_word_1.simplified
    assert bu_man_word_schema.pinyin_num == my_word_1.pinyin_num
    assert bu_man_word_schema.pinyin_accent == my_word_1.pinyin_accent
    assert bu_man_word_schema.pinyin_clean == my_word_1.pinyin_clean
    assert bu_man_word_schema.definitions == my_word_1.definitions

I won’t go through the details of the tests, they are similar to ones from previous posts using pytest. I will, however, parametrise them at some point :)

Finally, the commit for this post is here.