CNLearn Dictionary and Search Objects

So this post has been in draft mode for a month. Today is the day (well probably today was a week or so ago haha) we are going to quickly implement our dictionary and search objects and…. test them! yay.

Also, today was actually a week and a bit ago. I just finished writing some of the tests. Trying to remember what I started with. Ok so it’s the Dictionary object. It’s responsible to “connect” to the database (but not in the sense of doing the actual connection, you’ll see), it will implement the search, segment sentence and other methods too (down the line). Let’s have a look at its __init__() method.

from collections import defaultdict
from src.db.crud import get_simplified_word, get_word_and_character
from typing import (
    ClassVar,
    DefaultDict,
    Generator,
    List,
    Tuple,
    Union,
    Sequence,
)

from sqlalchemy.orm import Session
from src.db.settings import SessionLocal
from src.schemas.structures import Word, Character
from src.db.models import Word as Word_model, Character as Character_model
from jieba import initialize, cut
from src.search.textutils import extract_chinese_characters

class Dictionary():
    """
    Dictionary object. Will handle connecting to the database and
    implement search, segment and other methods.
    """
    def __init__(self):
        initialize()
        self._dictionary: ClassVar[Session] = SessionLocal()
        self.search_term: str = ""
        self.segmented_words: Generator[str, None, None]
        self.dictionary_cache: DefaultDict[str, List[Union[Character, Word]]] = defaultdict(list)
        self.words_found: List[Union[Word, Character]] = []
        self.unknown_words: List[str] = []
        self.search_history: DefaultDict[str, int] = defaultdict(int

First we import all the things we need. Some are for typing, some crud functions, some DB connection settings, our schemas and our models and finally some functions from jieba (responsible for our text segmentation) and a Chinese textutil function. What happens in the Dictionary class initialisation?

  1. We initialize() the jieba text segmentation engine
  2. We create a SQLAlchemy session and attach it to our Dictionary object.
  3. We set a search_term string variable as an empty string
  4. We set the type on the segmented_words Generator
  5. We create a dictionary cache default dict with a default of list
  6. We create a words_found list
  7. We create an unknown_words list
  8. We create a search_history defaultdict object with a default of int

Some of these might seem unnecessary but we will use all of them shortly. Before we look at the methods, let’s think of how the Dictionary search will work.

a = Dictionary()
a.search_chinese("你好")

It won’t actually return anything. The results will be in a.words_found, the unknown stuff will be in a.unknown_words. Why is it like this? It’s because later one we will call some of the methods separately in different places and I don’t really want to deal with the return values there. So first of all, there’s a segment_words method.

    def segment_words(self) -> None:
        """
        This method segments the string into words using Jieba.
        """
        self.segmented_words = cut(self.search_term, cut_all=False)

Why is it a separate method? Later on, it will receive some more parameters. Don’t want to repeat myself. Then, there’s a combine_word_and_character static method. What does that do? Well you know how we have some information that’s in the database in the words table and some that’s in the character table? If there’s a one character word, it needs information from both. This function essentially combines the two.

    @staticmethod
    def combine_word_and_character(
        word_result: Word_model, character_result: Character_model
    ) -> Character:
        character: Character = Character.from_orm(word_result)
        character.radical = character_result.radical
        character.decomposition = character_result.decomposition
        character.etymology = character_result.etymology
        return character

It takes in a Word model and a Character model (remember that the model refer to the database ORM models). It then creates a Character structure with the information from the word_result and then adds information to the extra fields from the character result. Ok those are the only two methods besides the search_chinese method. Let’s have a look at it.

    def search_chinese(self, search_term: str) -> None:
        """
        This method implements the search functionality for Chinese strings.
        It is what external programs will interact with.
        """
        # replace the current search term
        self.search_term = search_term
        # clear words found from previous search (still in cache)
        self.words_found.clear()
        # segment the search term
        self.segment_words()
        # iterate through each segmented word
        for word in self.segmented_words:
            # only look for it if it's not empty space
            if word.strip(): 
            # increase its value in the search_history
                self.search_history[word] += 1
                # first check to see if it's not cached
                if not self.dictionary_cache.get(word):
                    # check to see if it's a multiple character word, or single character word
                    if len(word) == 1:
                        word_character_results: List[
                            Tuple[Word_model, Character_model]
                        ] = get_word_and_character(self._dictionary, word)
                        # if the list is empty the next thing won't run
                        for word_result, character_result in word_character_results:
                            # use the Character structure
                            character: Character = self.combine_word_and_character(word_result, character_result)
                            self.words_found.append(character)
                            self.dictionary_cache[character.simplified].append(character)
                    else:
                        current_words = get_simplified_word(self._dictionary, word)
                        # for each of the words found, get their component characters
                        for result in current_words:
                            word = result.Word
                            component_characters = extract_chinese_characters(
                                word.simplified
                            )
                            current_word: Word = Word.from_orm(word)
                            for character, pinyin in zip(
                                component_characters, current_word.pinyin_accent.split()
                            ):
                                word_character_results = get_word_and_character(
                                    self._dictionary, character, pinyin_accent=pinyin
                                )
                                if len(word_character_results) == 0:
                                    word_character_results = get_word_and_character(
                                        self._dictionary, character
                                    )
                                for word_result, character_result in word_character_results:
                                    character: Character = self.combine_word_and_character(word_result, character_result)
                                    current_word.components.append(character)
                            self.words_found.append(current_word)
                            self.dictionary_cache[current_word.simplified].append(current_word)
                else:
                    cached_result = self.dictionary_cache.get(word)
                    self.words_found.extend(cached_result)
        # the result is stored in words_found

So it gets a search_term string. It sets the self.search_term to it. (Why? Will be useful later on.) It clears the words_found list from the previous search (note that any previous results will be in the cache). It segments the words. Nothing crazy for now. Then, we iterate one segmented word at a time. What happens then? I do like having a list of steps :). So:

  1. Strip any empty space and if there’s anything left, continue.
  2. Increase the search_history value of that word by 1. (this is part of an upcoming feature, might as well add it in now :) )
  3. Check if the word is not already in the dictionary cache. If it isn’t:
    1. Check if it is a single character word or not. If it is single character word:
      1. Call the word_and_character method to return the information from both tables at once.
      2. If there are multiple pronunciations and therefore multiple word results, combine the information from the Word result and Character results for each one and add that to the words_found list.
      3. Also add the word_found to the dictionary_cache for future (but still in the same session) searches.
    2. If it isn’t a single character word, call the get_simplified_word on the word.
    3. For each result returned, create a Word structure for it. Extract the set of unique character present in that word.
    4. Then, for each character present, call the get_word_and_character method on that character but also pass in the pinyin from the word. The reason we pass in the pinyin for that character is in order to get the relevant character (for the ones with multiple pronunciations).
    5. While we pass in the pinyin, there are cases when doing that will result on no results being found. Why? Well, because in certain words, certain characters change or lose their tone. Consequently, it’s not their “standard” form in the dictionary so the previous search might fail. If it does, redo the search without passing in the pinyin. (This part I am still thinking if there’s a better way)
    6. For each character/word results found, create a Character structure and add in the information from the Character and from the Word.
    7. Append the Word structure with the Character components to the words_found and also add to the cache.
  4. If in cache, get from cache.
  5. Extend the words_found list with the results present in the cache.
  6. We have our results! yay

And that is our current implementation of the search_chinese method of the Dictionary object. It will definitely change in the future as we add more features and find more edge cases that I have not yet thought about. Still, it works :).

Test Me Please

YOU WANT TESTS? You’re getting tests. There are only a few for now for some of the cases that failed in my previous implementation of CNLearn but they’re a good starting point. I am constantly adding more tests so that if someone ever decides to contribute to this project (funny right?), there’s extensive testing for it. I haven’t added coverage for now but it’s on my TODO list.

(Quick aside, I also modified some of the CRUD, well, Read functionality really, there’s not CUD haha, so I modified some of the previous tests. Nothing major though.)

In the test_dictionary.py file we have some fixtures first and then some tests. Quick look at the fixtures:

from typing import List
from src.search.dictionary import Dictionary
from src.schemas.structures import Character, Word
from src.db.models import Word as Word_model, Character as Character_model
import pytest

@pytest.fixture
def dictionary() -> Dictionary:
    """
    Returns a reusable database session.
    """
    d = Dictionary()
    return d


@pytest.fixture
def hao_characters() -> List[Character]:
    hao_1 = Character(
        id=27949,
        definitions="good; well; proper; good to; easy to; very; so; (suffix indicating completion or readiness); (of two people) close; on intimate terms; (after a personal pronoun) hello",
        stroke_diagram=None,
        simplified="好",
        traditional="好",
        pinyin_num="hao3",
        pinyin_accent="hǎo",
        pinyin_clean="hao",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=165789,
        character_type=None,
        radical="女",
        decomposition="⿰女子",
        etymology={"type": "ideographic", "hint": "A woman\xa0女 with a son\xa0子"},
    )
    hao_2 = Character(
        id=27950,
        definitions="to be fond of; to have a tendency to; to be prone to",
        stroke_diagram=None,
        simplified="好",
        traditional="好",
        pinyin_num="hao4",
        pinyin_accent="hào",
        pinyin_clean="hao",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=165789,
        character_type=None,
        radical="女",
        decomposition="⿰女子",
        etymology={"type": "ideographic", "hint": "A woman\xa0女 with a son\xa0子"},
    )
    return [hao_1, hao_2]


@pytest.fixture
def buhaoyisi_word() -> Word:
    """
    Returns the Word structure and Character components for 不好意思.
    """
    bu: Character = Character(
        id=1846,
        definitions="(negative prefix); not; no",
        stroke_diagram=None,
        simplified="不",
        traditional="不",
        pinyin_num="bu4",
        pinyin_accent="bù",
        pinyin_clean="bu",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=459467,
        character_type=None,
        radical="一",
        decomposition="⿱一?",
        etymology={"type": "ideographic", "hint": "A bird flying toward the sky\xa0一"},
    )
    hao: Character = Character(
        id=27949,
        definitions="good; well; proper; good to; easy to; very; so; (suffix indicating completion or readiness); (of two people) close; on intimate terms; (after a personal pronoun) hello",
        stroke_diagram=None,
        simplified="好",
        traditional="好",
        pinyin_num="hao3",
        pinyin_accent="hǎo",
        pinyin_clean="hao",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=165789,
        character_type=None,
        radical="女",
        decomposition="⿰女子",
        etymology={"type": "ideographic", "hint": "A woman\xa0女 with a son\xa0子"},
    )
    yi: Character = Character(
        id=39615,
        definitions="idea; meaning; thought; to think; wish; desire; intention; to expect; to anticipate",
        stroke_diagram=None,
        simplified="意",
        traditional="意",
        pinyin_num="yi4",
        pinyin_accent="yì",
        pinyin_clean="yi",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=8201,
        character_type=None,
        radical="心",
        decomposition="⿱音心",
        etymology={
            "type": "pictophonetic",
            "phonetic": "音",
            "semantic": "心",
            "hint": "heart",
        },
    )
    si: Character = Character(
        id=38511,
        definitions="to think; to consider",
        stroke_diagram=None,
        simplified="思",
        traditional="思",
        pinyin_num="si1",
        pinyin_accent="sī",
        pinyin_clean="si",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=6943,
        character_type=None,
        radical="心",
        decomposition="⿱田心",
        etymology={
            "type": "ideographic",
            "hint": "Weighing something with your mind\xa0囟 (altered) and heart\xa0心",
        },
    )
    buhaoyisi: Word = Word(
        id=2117,
        definitions="to feel embarrassed; to find it embarrassing; to be sorry (for inconveniencing sb)",
        stroke_diagram=None,
        simplified="不好意思",
        traditional="不好意思",
        pinyin_num="bu4 hao3 yi4 si5",
        pinyin_accent="bù hǎo yì si",
        pinyin_clean="bu hao yi si",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=3667,
        pinyin_no_spaces="buhaoyisi",
        components=[bu, hao, yi, si],
        radical=None,
        hsk=None,
    )
    return buhaoyisi


@pytest.fixture
def yidali_word() -> Word:
    return Word(
        id=39630,
        definitions="Italy; Italian",
        stroke_diagram=None,
        simplified="意大利",
        traditional="意大利",
        pinyin_num="Yi4 da4 li4",
        pinyin_accent="Yì dà lì",
        pinyin_clean="Yi da li",
        also_pronounced="",
        also_written="",
        classifiers="",
        frequency=4049,
        pinyin_no_spaces="Yidali",
        components=[
            Character(
                id=39614,
                definitions="Italy; Italian; abbr. for 意大利(Yì dà lì)",
                stroke_diagram=None,
                simplified="意",
                traditional="意",
                pinyin_num="Yi4",
                pinyin_accent="Yì",
                pinyin_clean="Yi",
                also_pronounced="",
                also_written="",
                classifiers="",
                frequency=8201,
                character_type=None,
                radical="心",
                decomposition="⿱音心",
                etymology={
                    "type": "pictophonetic",
                    "phonetic": "音",
                    "semantic": "心",
                    "hint": "heart",
                },
            ),
            Character(
                id=25709,
                definitions="big; huge; large; major; great; wide; deep; older (than); oldest; eldest; greatly; very much; (dialect) father; father's elder or younger brother",
                stroke_diagram=None,
                simplified="大",
                traditional="大",
                pinyin_num="da4",
                pinyin_accent="dà",
                pinyin_clean="da",
                also_pronounced="",
                also_written="",
                classifiers="",
                frequency=176304,
                character_type=None,
                radical="大",
                decomposition="⿻一人",
                etymology={
                    "type": "ideographic",
                    "hint": "A man\xa0人 with outstretched arms",
                },
            ),
            Character(
                id=12976,
                definitions="sharp; favorable; advantage; benefit; profit; interest; to do good to; to benefit",
                stroke_diagram=None,
                simplified="利",
                traditional="利",
                pinyin_num="li4",
                pinyin_accent="lì",
                pinyin_clean="li",
                also_pronounced="",
                also_written="",
                classifiers="",
                frequency=11305,
                character_type=None,
                radical="刂",
                decomposition="⿰禾刂",
                etymology={
                    "type": "ideographic",
                    "hint": "Harvesting\xa0\xa0grain\xa0禾",
                },
            ),
        ],
        radical=None,
        hsk=None,
    )

We have a dictionary fixture (which creates a Dictionary object connected to the actual database file). Then, we have a fixture for the various 好 word/character results. Then, for the 不好意思 word (this one specifically because the 思 loses its tone when in the word so I am testing that when the pinyin gets passed fo that character, nothing is found, and then the search happens again without the pinyin being passed). Then, one for the 意大利 word. Why that one? Because the 意 character has the Yì pinyin, not the yì, so testing if the correct pinyin gets passed. Again, only a few cases for now. More will be added later and at some point will also separate the fixtures into a separate file I think otherwise it gets tooooooooooooooo long.

Testing Dictionary Object Creation

The first test,

def test_dict_is_created(dictionary):
    """
    Checks if the dictionary object was created and is usable.
    """
    assert dictionary.search_term == ""
    assert len(dictionary.dictionary_cache) == 0
    assert len(dictionary.unknown_words) == 0
    assert len(dictionary.words_found) == 0
    assert len(dictionary.search_history) == 0

where dictionary refers to the dictionary fixture, tests whether the state of the dictionary object created is as we’d expect. 0 words in the cache, in unknown words and in search_history. (Did you notice how I punctuated the code? Is that a thing to do? I recently had to do that for my thesis corrections….not fun but looks so much prettier, thanks Chris for suggesting I should do that!)

Testing One Character/Word with Multiple Results

In the second test,

def test_one_character_multiple_results(dictionary, hao_characters):
    """
    Checks the results for when `好` is searched for.
    """
    dictionary.search_chinese("好")
    assert dictionary.search_term == "好"
    assert len(dictionary.words_found) == 2
    assert "好" in dictionary.dictionary_cache
    assert len(dictionary.dictionary_cache) == 1
    assert len(dictionary.dictionary_cache["好"]) == 2
    assert all(character in dictionary.words_found for character in hao_characters)
    assert all(
        character in dictionary.dictionary_cache["好"] for character in hao_characters
    )

the hao_characters argument refers to the hao_character fixture from earlier. So what are we testing? We call the search_chinese method with 好 as the search phrase. We test that the search_term got set to that. We test that two words were found and added to the words_found list. Why two? Because there are two pronunciations for it. We test that 好 is in the dictionary_cache. We check that in the dictionary cache for 好 there are two results (for both pronunciations). Then we check that the fixtures we defined are both present in both the words_found and in the dictionary_cache for 好. And we finished the test for 好.

Testing The Dictionary Cache is Actually Used

In the third test,

def test_dictionary_reuse(dictionary, mocker):
    """
    in order to see the cache work, let's mock the crud functions and
    see how many times they are called.
    """
    hao_word_1 = Word_model(
        traditional="好",
        pinyin_accent="hǎo",
        id=27949,
        pinyin_no_spaces="hao",
        also_pronounced="",
        definitions="good; well; proper; good to; easy to; very; so; (suffix indicating completion or readiness); (of two people) close; on intimate terms; (after a personal pronoun) hello",
        frequency=165789,
        pinyin_num="hao3",
        simplified="好",
        pinyin_clean="hao",
        also_written="",
        classifiers="",
    )
    hao_word_2 = Word_model(
        traditional="好",
        pinyin_accent="hào",
        id=27950,
        pinyin_no_spaces="hao",
        also_pronounced="",
        definitions="to be fond of; to have a tendency to; to be prone to",
        frequency=165789,
        pinyin_num="hao4",
        simplified="好",
        pinyin_clean="hao",
        also_written="",
        classifiers="",
    )
    hao_character = Character_model(
        character="好",
        pinyin="hǎo",
        etymology={"type": "ideographic", "hint": "A woman\xa0女 with a son\xa0子"},
        matches="[[0], [0], [0], [1], [1], [1]]",
        definition="good, excellent, fine; proper, suitable; well",
        decomposition="⿰女子",
        id=1595,
        radical="女",
        frequency=165789,
    )
    mocked_get_word_and_character = mocker.patch(
        # the function to mock is imported in the dictionary class file
        "src.search.dictionary.get_word_and_character",
        return_value=[
            (hao_word_1, hao_character),
            (hao_word_2, hao_character),
        ],
    )
    dictionary.search_chinese("好")
    assert mocked_get_word_and_character.call_count == 1

    # let's do the same search again. this time, the dictionary object
    # should get it from cache
    dictionary.search_chinese("好")
    assert mocked_get_word_and_character.call_count == 1

    # ok maybe you don't believe me? Let's look for another word
    # please note that the return won't be the correct one for
    # 我, rather it will be the same as it was for 好
    # but that's fine, I'm testing how many times the function
    # gets called and if the cache works
    dictionary.search_chinese("我")
    assert mocked_get_word_and_character.call_count == 2
    # now do you believe me?
    dictionary.search_chinese("好")
    assert mocked_get_word_and_character.call_count == 2
    # and now??? I can keep going...
    dictionary.search_chinese("好")
    assert mocked_get_word_and_character.call_count == 2
    # ok that's enough for this test.

the dictionary argument refers to the dictionary object fixture from earlier but we have a new argument: mocker. It is a fixture that comes from the pytest-mock package which just wraps some of the API provided by unittest.mock in the Python library. Why do we need it? Well, in order to test how many times a function gets called, we are going to use that. We could have written a decorator for it as well but then we’d have to decorate our function at the source, I don’t want to do that, so I am mocking it and its return. Note that I am first setting the hao_word_1, hao_word_2 and hao_character SQLAlchemy models equal to what the database would return. They are not the Pydantic structures here. Then, note how I am mocking the function. Not where it comes from, but where it’s imported and used. In other words, I am not mocking “src.db.crud.get_word_and_character” but “src.search.dictionary.get_word_and_character” because that’s where that function is imported and used. I also set its return value to a list of tuples because that is what will get returned when 好 is called.

We then invoke the search_chinese method and check its call_count. It’s 1. We then do the same search. It’s still 1. But then, maybe you don’t believe me. So I call the search_chinese with 我 as the argument. Note that the return here doesn’t matter. It’s going to return the results for 好 but I don’t care. I just want to see that when I searcehd for 我, the method got called again so its call_count increased to 2. Then I searched for 好 again, and again…and the call_count stays at 2. So our cache “works”. (TODO: still need to implement a character cache, not just a word one).

Testing 不好意思 Multiple Character Word Where the Pinyin of a Character Changes From the Standard

That’s a long subheading. Luckily the fourth test,

def test_multiple_character_word(dictionary, buhaoyisi_word):
    """
    This will test a multiple character word that will find the word
    and have 4 component characters.
    """
    dictionary.search_chinese("不好意思")
    assert buhaoyisi_word in dictionary.words_found

is short. It asserts that the result is as we’d expect. I mentioned it earlier, but when looking for the 思 character, it initially doesn’t find anything when passing in the pinyin of si. So then it searches for it again without the pinyin.

Testing 意大利 Multiple Character Word Where the Pinyin of a Character is not the Standard One BUT There Is a Result for It

Even longer subheading, but the test for it,

def test_multiple_character_word_2(dictionary, yidali_word):
    """
    This will test a multiple character word with a specific 
    pinyin that is found in the database.
    """
    dictionary.search_chinese("意大利")
    assert yidali_word in dictionary.words_found

is once again short. I think it’s pretty self explanatory but it just ensures the result is what we’d expect. The reasoning was mentioned earlier but for your convenience, the test is because the 意 character has the Yì pinyin, not the yì, so testing if the correct pinyin gets passed.

The commit for this post is here