CNLearn FastAPI - Upgrades, DB Improvements, Adding Word and Character Models

Today we will do a few database related things. We will change our Alembic env file to add a few useful features, add some naming conventions for indices to the MetaData object, upgrade to SQLAlchemy 2 and FINALLY, add the words and characters tables. Also, I’ve upgraded it all to 3.11 (what would a side project be if I wasn’t always playing with the latest and shiniest?)

Why are we doing all of this? A few reasons:

  • there was a small (naming) mistake in the original DDL for the users table (oops)
  • by adding naming conventions for indices (as recommended by Alembic/SQLAlchemy), it will be easier to keep things consistent :) (who doesn’t love a standard? well, now we have 2 standards. you want a better standard? well, now we have 3 standards…)
  • by auto reflecting the models from the Base metadata, it will allow for “better”, or at least easier, migrations
  • SQLAlchemy 2 adds a lot of static typing conventions that will allow us to get rid of the sqlalchemy stubs package

Finally, we add the words and character tables (and add some data to them).

Small Naming Mistakes and Adding Naming Conventions to Constraints

Remember (obviously not, noone is reading this) how we had a Base class with:

class BaseWithID:
    @declared_attr
    def __tablename__(cls):
        return cls.__name__.lower()

but then we manually set the name on the User class because “user” clashed with something in PostgreSQL? Well, the naming convention is still there (in a new base class shape, we’ll get to that shortly) but our constraints could have used with a bit of work:

    op.create_index(op.f("ix_user_email"), "users", ["email"], unique=True)
    op.create_index(op.f("ix_user_full_name"), "users", ["full_name"], unique=False)
    op.create_index(op.f("ix_user_id"), "users", ["id"], unique=False)

While they were added to the correct “users” table, I prefer having the indices names (and other constraints and stuff) to be correctly named (i.e. ix_users_email instead of ix_user_email). So we have fixed those but also added some naming conventions (as recommended here):

meta = MetaData(naming_convention={
        "ix": "ix_%(column_0_label)s",
        "uq": "uq_%(table_name)s_%(column_0_name)s",
        "ck": "ck_%(table_name)s_`%(constraint_name)s`",
        "fk": "fk_%(table_name)s_%(column_0_name)s_%(referred_table_name)s",
        "pk": "pk_%(table_name)s"
      })

Autogenerated Migrations

Remember we are using Alembic for our migrations? Remember how we create the migration script? Well, from now on we will autogenerate them. Why? Because we can and makes things simpler and easier and it wouldn’t really be a tutorial series if we didn’t always add a new thing wouldn’t it? Keep in mind that this doesn’t mean you won’t have to ever write some migrations yourself, there are sometimes complicated things we need to do with databases, but for our simple project this will work well.

how do we do it? Funnily enough, the env.py file tells you right away:

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata

I guess we’re going to do that. One thing that’s important to note is that it’s not enough to import the Base.metadata, you also need to import the models! I forgot to import them and kept wondering why I had empty autogenerated migrations (talking about empty migrations, we’ll get to that soon as well).

We therefore add:

...
from app.db.base import Base
from app.models.user import User
from app.models.word import Word
from app.models.character import Character
...
target_metadata = Base.metadata
...

Yes, those are 2 new models which we’ll get to soon. Before we do that, we need to look at some of the new SQLAlchemy 2.0 notation that will work great with type checkers (such as mypy which we use) without any additional stubs packages.

Upgrading to SQLAlchemy 2.0 notation

Have a look here if you want to read more about the 2.0 notation that works well with type checkers.

Let’s start by having a look at our slightly changed Base model.

It’s going from the old notation of

class BaseWithID:
    @declared_attr
    def __tablename__(cls):
        return cls.__name__.lower()

    id: int = Column(Integer, primary_key=True, index=True)


# declarative base class
Base = declarative_base(cls=BaseWithID)

to the new notation of:

from sqlalchemy import Integer, MetaData
from sqlalchemy.ext.declarative import declared_attr
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

custom_metadata = ...  # what we discussed earlier

class Base(DeclarativeBase):

    @declared_attr.directive
    def __tablename__(cls) -> str:
        return cls.__name__.lower() + "s"

    id: Mapped[int] = mapped_column(Integer, primary_key=True, index=True)

    metadata = custom_metadata

It’s not actually very different. The big difference is rather than using the declarative_base function with the BaseWithID metaclass, we inherit from a DeclarativeBase now.

Then, in order to make the various attributes work well with type checkers, we use the generic Mapped class with our atttribute type, in this case an integer.

Instead of Column, we use mapped_column and that’s basically it.

We do something similar in the existing User model:



class User(Base):

    __tablename__ = "users"

    full_name: Mapped[Optional[str]] = mapped_column(String, index=True)
    email: Mapped[str] = mapped_column(String, unique=True, index=True, nullable=False)
    hashed_password: Mapped[str] = mapped_column(String, nullable=False)
    is_active: Mapped[bool] = mapped_column(Boolean(), default=True, nullable=False)
    is_superuser: Mapped[bool] = mapped_column(Boolean(), default=False, nullable=False)

    def __repr__(self):
                return f"<User(full_name='{self.full_name}', email='{self.email}'>"

Before discussing that in detail, please note that for simple fields without any specifications, it’s enough to now type (get it, type?):

primary_language: Mapped[str]

Since we have some options for each of those (like there being an index, having a default value), we do specify the type (and whether it’s optional) and we add our indices and null conditions to the mapped_column function.

Otherwise, everything is the same :) (well, it is until I look at the queries, I haven’t checked if anything changed there.)

Word and Characters Models

Finally we’re getting somewhere more interesting. The models were actually first introduced [here](https://www.podgurschi.com/post/cnlearn-dictionary-search/. I won’t go into too much detail from scratch, but we will make some changes. In that one, there was no relationship (I was being lazy, or I just haven’t thought too deeply of it at that time, that change is definitely coming there as well). In this one, there will be. How so? Well, we will have a relationship between certain words and characters. In the words database, most of the words will have more than one character in them, as such, those will obviously not be able to have a one-to-many, many-to-one or one-to-one relationship. Instead, there will be a many-to-many relationship using some kind of association table. Why is there a many to many relationship? Consider the following two results from the words table where I did a search of simplified containing 好 (obviously I didn’t include all the results…):

[
  {
    "id": 2083,
    "simplified": "不问好歹",
    "traditional": "不問好歹",
    "pinyin_num": "bu4 wen4 hao3 dai3",
    "pinyin_accent": "bù wèn hǎo dǎi",
    "pinyin_clean": "bu wen hao dai",
    "pinyin_no_spaces": "buwenhaodai",
    "also_written": "",
    "also_pronounced": "",
    "classifiers": "",
    "definitions": "no matter what may happen (idiom)",
    "frequency": 6639327
  },
  {
    "id": 16941,
    "simplified": "友好关系",
    "traditional": "友好關係",
    "pinyin_num": "you3 hao3 guan1 xi4",
    "pinyin_accent": "yǒu hǎo guān xì",
    "pinyin_clean": "you hao guan xi",
    "pinyin_no_spaces": "youhaoguanxi",
    "also_written": "",
    "also_pronounced": "",
    "classifiers": "",
    "definitions": "good relations",
    "frequency": 6639327
  }
]

as well as a search for the character 好 in the characters table:

{
  "id": 1595,
  "character": "好",
  "definition": "good, excellent, fine; proper, suitable; well",
  "pinyin": "hǎo",
  "decomposition": "⿰女子",
  "etymology": "{\"type\": \"ideographic\", \"hint\": \"A woman\\u00a0\\u5973 with a son\\u00a0\\u5b50\"}",
  "radical": "女",
  "matches": "[[0], [0], [0], [1], [1], [1]]",
  "frequency": 165789
}

Clearly the character 好 (I promise, I know other characters too…) appears in many words and in each word there are multiple characters. An analogy would be a common BlogPost (think of it as the Word) and Tag (think of it as the Character) models: each blog post (word) can have multiple tags (characters), and a tag )(character) can be found in multiple blog posts (words).

Now that the analogy is somewhat poorly explained (I did my best), let’s look at the two models (well, actually threeish). I need to put all three at once as they refer to each other.


word_character_association_table = Table(
    "word_characters",
    Base.metadata,
    Column("word_id", ForeignKey("words.id"), primary_key=True),
    Column("character_id", ForeignKey("characters.id"), primary_key=True),
)


class Character(Base):

    character: Mapped[str] = mapped_column(String(1))
    definition: Mapped[Optional[str]] = mapped_column(String(150), nullable=True)
    pinyin: Mapped[str] = mapped_column(String(50))
    decomposition: Mapped[Optional[str]] = mapped_column(String(15), nullable=True)
    etymology: Mapped[Optional[dict[str, str]]] = mapped_column(JSON(), nullable=True)
    radical: Mapped[str] = mapped_column(String(1))
    matches: Mapped[str] = mapped_column(String(300))
    frequency: Mapped[int]
    words: Mapped[set["Word"]] = relationship(secondary=word_character_association_table, back_populates="characters")

    def __repr__(self) -> str:
        return f"<Character({self.character}, radical='{self.radical})>"


class Word(Base):

    simplified: Mapped[str] = mapped_column(String(50))
    traditional: Mapped[str] = mapped_column(String(50))
    pinyin_num: Mapped[str] = mapped_column(String(150))  # at least 104
    pinyin_accent: Mapped[str] = mapped_column(String(100))
    pinyin_clean: Mapped[str] = mapped_column(String(100))
    pinyin_no_spaces: Mapped[str] = mapped_column(String(100))
    also_written: Mapped[str] = mapped_column(String(75))
    also_pronounced: Mapped[str] = mapped_column(String(75))
    classifiers: Mapped[str] = mapped_column(String(25))
    definitions: Mapped[str] = mapped_column(String(500))
    frequency: Mapped[int]
    characters: Mapped[set[Character]] = relationship(
        secondary=word_character_association_table, back_populates="words"
    )

    def __repr__(self) -> str:
        return f"<Word(simplified='{self.simplified}', pinyin='{self.pinyin_accent}'>"

The Character model has a bunch of columns we expect to have, same for the Word. The important big is the words relationship in Character and the characters relationship in Word.

class Character(Base):
    ...
    words: Mapped[set["Word"]] = relationship(secondary=word_character_association_table, back_populates="characters")
    ...

class Word(Base):
     ...
     characters: Mapped[set[Character]] = relationship(
        secondary=word_character_association_table, back_populates="words"
    )
    ...

This is basically saying that “words” is a set of “Word” objects on the Characer, and characters is a set of Characters on the Word. The actual many-to-many relationship is given by the word_character_association_table which is called “word_characters” and has two columns that relate the two tables together. We create the migration with alembic and migrate our table. But let’s not forget to add this to our alembic env.py file:

from app.models.vocabulary import (  # noqa: F401
    Character,
    Word,
    word_character_association_table,
)

Otherwise how will it know we did it?

alembic revision --autogenerate -m "Adding Words and Characters tables"

Populating the Word and Character Tables

Well we have three new tables in our cnlearn database. Let’s connect to it and run \dt; and we should see:

              List of relations
 Schema |      Name       | Type  |  Owner
--------+-----------------+-------+----------
 public | alembic_version | table | postgres
 public | characters      | table | postgres
 public | users           | table | postgres
 public | word_characters | table | postgres
 public | words           | table | postgres

We don’t have any data in characters, words or word_characters. Since we already had the data in the SQLite database in the GUI version (I honestly don’t know why I stopped, but hopefully when I am not working I have time to continue, there’s so much more we can do there), we’re just going to extract it from there and add it to our PostgreSQL database. I’m going to export to CSV from SQLite and import it in PG.

Assuming we are in folder where we have the dictionary.db sqlite file:

sqlite3 dictionary.db
.headers on
.mode csv
.output words.csv
SELECT * FROM words;
.quit



COPY words(id,simplified,traditional,pinyin_num,pinyin_accent,pinyin_clean,pinyin_no_spaces,also_written,also_pronounced,classifiers,definitions,frequency)
FROM 'FULLPATH/words.csv'
DELIMITER ','
CSV HEADER;



sqlite3 dictionary.db
.headers on
.mode csv
.output characters.csv
SELECT * FROM characters;
.quit



COPY characters(id,character,definition,pinyin,decomposition,etymology,radical,matches,frequency)
FROM 'FULLPATH/characters.csv'
DELIMITER ','
CSV HEADER;

Aaaand a minute later, or less, they should all be there. Final thing, we need to create the many-to-many associations. To be honest I might change my mind about this as I haven’t done it yet and I don’t know how big the table will get :D so let’s try. Sadly, I don’t know of a simple way to do it in PostgreSQL. The only way I could think of was to essentially use a regexp_split_to_array with a pattern like ‘(?!^)’ but then still have get rid of things I don’t want, so let’s do it in Python. It’s gonna be a long loop…I know, but let’s try and see. Gonna run this synchronously:

from sqlalchemy import create_engine, select
from sqlalchemy.exc import NoResultFound
from sqlalchemy.orm import sessionmaker

from app.models.vocabulary import Character, Word


# this is for running locally, don't mind if you see my postgres postgres
# username and password
engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/cnlearn", echo=False)
engine.connect()
Session = sessionmaker(engine)

with Session() as session:
    all_words = session.execute(select(Word)).scalars()
    for word in all_words:  # type: Word
        characters: set[str] = set(
            character for character in word.simplified
            if character.isalpha()
        )
        if not characters:
            continue
        else:
            # let's get the character object
            character_set = set()
            for character in characters:
                try:
                    character_object = session.execute(
                        select(Character).where(Character.character==character)
                    ).scalar_one()
                except NoResultFound:
                    pass
                else:
                    character_set.add(character_object)
            if character_set:
                word.characters = character_set
                session.add(word)
    session.commit()

What do we do? We get all the words. We go one by one, and for each word’s characters, we try to find a corresponding result in the characters table. If we find it, we add to its character_set set and then we set the word’s characters. We add that word to the session and we commit at the end. I should say this took about 30 minutes on my laptop to run and took a bit of memory :D so it’s not efficient, it’s not great! but it works.

I think that’s it for this post. In the next one (I’m gonna force myself to write it in the next 2-3 weeks), we’ll start querying for these and add some of the endpoints for them. We’ll start with the individual word and character ones and we’ll build up from there to some kind of search functionality maybe. The commit for this post is here.