CNLearn - Vocab CRUD

This relates to this commit from a few days ago. What did I do? I removed all the old code that I’ve since changed. Will be added one thing at a time with lots of tests for each part. The GUI is currently gone but I have added the functionality for connecting to the DB as well as the Pydantic schemas (but that will be in the next post). In this post let’s quickly talk about connecting to the database. Let’s have a look in the src/db directory. We have the following files:

  • crud.py (where we will implement the various functions using the db)
  • dictionary.db - the actual dictionary database
  • models.py - the database models
  • settings.py - information related to connecting to the database

The models file I won’t go through since they are the models we defined in here. If you want a remidner please look there to see what fields the models have. In the settings.py file we have:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.session import Session


SQLALCHEMY_DATABASE = "sqlite+pysqlite:///db/dictionary.db"
engine = create_engine(
    SQLALCHEMY_DATABASE, echo=False, future=True
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine, future=True)

I might use Pydantic in the future to keep the settings but this works well for now. When we want to connect to the database, we simply need to import that SessionLocal and call it in order to return a session with the right settings.

What about the crud.py file? Yes I know it’s called CRUD but it only has reading functions in there for now. I am not sure whether DDD is something I want to pursue in this project, it’s probably overkill. So I won’t do DAL or repository stuff.

from sqlalchemy.orm import Session
from sqlalchemy import and_
from sqlalchemy import select
from sqlalchemy.sql.selectable import Select
from sqlalchemy.engine import ChunkedIteratorResult, Row
from src.db.models import Word, Character
from typing import List


def get_simplified_word(
    db: Session, simplified: str, pinyin_clean: str = None
) -> List[Row]:
    word_selection: Select = (
        select(Word).where(Word.simplified == simplified).order_by(Word.frequency)
    )
    if pinyin_clean:
        word_selection = word_selection.where(Word.pinyin_clean == pinyin_clean)
    result: ChunkedIteratorResult = db.execute(word_selection)
    words: List[Row] = result.all()
    return words


def get_simplified_word_containing_char(
    db: Session, simplified: str, pinyin_clean: str = None
) -> List[Row]:
    word_selection: Select = (
        select(Word)
        .where(Word.simplified.contains(simplified))
        .order_by(Word.frequency)
    )
    if pinyin_clean:
        word_selection = word_selection.where(Word.pinyin_clean.contains(pinyin_clean))
    result: ChunkedIteratorResult = db.execute(word_selection)
    words: List[Row] = result.all()
    return words


def get_simplified_character(db: Session, simplified: str) -> Row:
    character: Row = db.execute(
        select(Character).where(Character.character == simplified)
    ).first()
    return character


def get_word_and_character(
    db: Session, simplified: str, pinyin_clean: str = None, pinyin_accent: str = None
) -> Row:
    word_and_character_select: Select = (
        select(Word, Character)
        .where(
            and_(
                Word.simplified == simplified,
                Word.simplified == Character.character,
            )
        )
        .order_by(Word.frequency)
    )
    if pinyin_clean:
        word_and_character_select = word_and_character_select.where(
            Word.pinyin_clean == pinyin_clean
        )
    if pinyin_accent:
        word_and_character_select = word_and_character_select.where(
            Word.pinyin_accent == pinyin_accent
        )
    word_and_character: Row = db.execute(word_and_character_select).first()
    return word_and_character

Let’s take it function by function. We will start with the get_simplified_word() function.

def get_simplified_word(
    db: Session, simplified: str, pinyin_clean: str = None
) -> List[Row]:
    word_selection: Select = (
        select(Word).where(Word.simplified == simplified).order_by(Word.frequency)
    )
    if pinyin_clean:
        word_selection = word_selection.where(Word.pinyin_clean == pinyin_clean)
    result: ChunkedIteratorResult = db.execute(word_selection)
    words: List[Row] = result.all()
    return words

This function takes in a db session (that we get from sessionlocal), a chinese word (with simplified characters) and an optional pinyin_clean string. It first creates a Select statement. Then, if a pinyin_clean is also provided, the select statement gets modified. Finally, we use the session to execute that statement and use the .all method on the method to get a list of Rows (i.e. a list of of database results). The function then returns that.

What about the next function get_simplified_word_containing_char?

def get_simplified_word_containing_char(
    db: Session, simplified: str, pinyin_clean: str = None
) -> List[Row]:
    word_selection: Select = (
        select(Word)
        .where(Word.simplified.contains(simplified))
        .order_by(Word.frequency)
    )
    if pinyin_clean:
        word_selection = word_selection.where(Word.pinyin_clean.contains(pinyin_clean))
    result: ChunkedIteratorResult = db.execute(word_selection)
    words: List[Row] = result.all()
    return words

It takes in a db session, a string with a simplified character/word and returns the words that contain that character. If an optional pinyin_clean is passed, it only returns the words that contain that pinyin. Why is the latter necessary? For characters that have multiple pronunciations.

What does get_simplified_character do?

def get_simplified_character(db: Session, simplified: str) -> Row:
    character: Row = db.execute(
        select(Character).where(Character.character == simplified)
    ).first()
    return character

It takes in a simplified character (looking at it I should put in some restriction on the length since it’s only supposed to be one character long), and returns the database result from the “characters” table for that character. I use first() here because each character is unique in the table.

Finally, let’s see get_word_and_character.

def get_word_and_character(
    db: Session, simplified: str, pinyin_clean: str = None, pinyin_accent: str = None
) -> Row:
    word_and_character_select: Select = (
        select(Word, Character)
        .where(
            and_(
                Word.simplified == simplified,
                Word.simplified == Character.character,
            )
        )
        .order_by(Word.frequency)
    )
    if pinyin_clean:
        word_and_character_select = word_and_character_select.where(
            Word.pinyin_clean == pinyin_clean
        )
    if pinyin_accent:
        word_and_character_select = word_and_character_select.where(
            Word.pinyin_accent == pinyin_accent
        )
    word_and_character: Row = db.execute(word_and_character_select).first()
    return word_and_character

Again, I will put some length restrictions here too since it’s only meant to be used for single character words. It takes in that, and an optional pinyin_clean and/or pinyin_accent (for reasons that I will discuss in an upcoming post). It then creates a select statement but selects both the Word and Character models at the same time. If pinyins are provided, modifies/narrows down the selection. Then, it only returns the first result. Again, we will see why only the first in an upcoming post. The Row that the function returns, however, has both a Word and a Character parts to it. Let’s see how we can extract each by looking at the tests I wrote for these functions. In the tests directory, open test_crud.py. I won’t go through the entire file since it’s similar to the tests written for the DB extraction process. But let’s look at one of them

def test_get_word_and_character_with_clean_pinyin(db: Session):
    word_character_result: Row = get_word_and_character(
        db, simplified="好", pinyin_clean="hao"
    )
    assert len(word_character_result) == 2
    word: Word = word_character_result.Word
    character: Character = word_character_result.Character
    assert isinstance(word, Word)
    assert isinstance(character, Character)

This test uses the get_word_and_character function. Remember I said it returns a Row that contains both the Word and Character models? Well, the way to get either is to do row.Word or row.Character. You could use indexes of 0 and 1 too, but that makes it less clear and if the order then changes it will introduce mistakes into it.

That’s it for this post. In the next one we will go through the Pydantic schemas I am using and a nice (remade) UML diagram for them. Until then!