CNLearn Dictionary Search

This series of posts originally appeared on the CNLearn website. I am, however, planning on putting a web version that I will add features to there so I am moving the series here and putting a simple cover page there. Furthermore, since then, I’ve rewritten it a few times and discovered new things, found mistakes in my implementation, so decided to rewrite every single new post to account for any changes I made. If you’re interested in reading the original post, feel free to go to the website (temporarily) but I have also saved it in an archive link here.

The project started because there was no fully-featured chinese learning tool on Ubuntu (that I could find). There was an old dictionary but I couldn’t get it working. That’s when I decided to implement my own. Also, full text search is a pain. No wonder there are so many implementations for it. The dictionary I used was the CC-CEDICT one. The license for it is a CC BY-SA 3.0. Consequently, I am required to give appropriate credit, link to the license and indicate the changes. The contributions to it will also be released under the same license as the original. The cross-platform version will be open-source with easy access to all the databases created that use the information from CC-CEDICT. Hopefully that covers all grounds. Since I’ve written that post, I also wanted to add information on the characters (their radicals, decomposition, etc.) For that, I used data from the Make Me a Hanzi project. I used that data for both the OS app (i.e. Linux, Windows, Mac) and for the web version.

Initially I thought that the format of each word in the dictionary was as follows:

交戰 交战 [jiao1 zhan4] /to fight/to wage war/

Then, I realised it could also include classifier words (aka measure words) for certain nouns. Then, I realised it could also have a also written as part. Then, I realised it could also have a also pronounced as part. You probably see where this is going. Many relisations.

So, I started with small Python application that would extract that information and save it to a database. I used SQLAlchemy with sqlite. Consquently, this post also touches and improves upon this post. The two models I used for the dictionary are as follows:


class Word(Base):
    __tablename__ = "words"

    id = Column(Integer, primary_key=True)
    simplified = Column(String(50))
    traditional = Column(String(50))
    pinyin_num = Column(String(100))
    pinyin_accent = Column(String(100))
    pinyin_clean = Column(String(100))
    pinyin_no_spaces = Column(String(100))
    also_written = Column(String(100))
    also_pronounced = Column(String(100))
    classifiers = Column(String(100))
    definitions = Column(String(500))
    frequency = Column(Integer)

    def __repr__(self):
        return f"<Word(simplified='{self.simplified}', pinyin='{self.pinyin_accent}'"



class Character(Base):
    __tablename__ = "characters"

    id = Column(Integer, primary_key=True)
    character = Column(String(1))
    definition = Column(String(150), nullable=True)
    pinyin = Column(String(50))
    decomposition = Column(String(15))
    etymology = Column(JSON(), nullable=True)
    radical = Column(String(1))
    matches = Column(String(100))
    frequency = Column(Integer)

    def __repr__(self):
        return f"<Character({self.character}, radical='{self.radical})>"

Why didn’t I use one of the fields from the CEDICT file as a primary key? Because none were actually unique and I didn’t want to go with a composite key. Keeping in mind these fields, let’s look at the extraction process. It uses a few functions designed for pinyin manipulation, so let’s look at those first. The main function that is called is convert_pinyin.


def convert_pinyin(
    item: Union[str, List[str]], flag: str
) -> Union[str, List[str], None]:
    """
    This function converts pinyin with numbers to either pinyin with tone marks
    (accents) or clean pinyin (no numbers or accents).
    """
    if flag in ("accent", "clean"):
        if isinstance(item, str):
            if flag == "accent":
                return convert_to_pinyin_accent(item)
            # flag is clean
            return convert_to_pinyin_clean(item)
        if isinstance(item, list):
            pinyin_list: List = []
            for i in item:
                pinyin_list.append(convert_pinyin(i, flag))
            return pinyin_list
        raise ValueError("Text must be a string or list of strings.")
    raise ValueError("Flag must be `accent` or `clean`.")

Based on whether a string or a list of string is passed, either the first part of the function is called or the second part otherwise. The function also accepts a flag that can be “accent” or “clean” so that it returns pinyin with tone accents or pinyin without any numbers of tone accents. If a different flag is passed, a ValueError is raised. Let’s inspect the convert_to_pinyin_accent function first:

def convert_to_pinyin_accent(word: str) -> str:
    """
    This function converts a pinyin with numbers to pinyin with accents.
    """
    # the list below is a list of vowels that appear in the CEDICT pinyin
    vowels: List[str] = ["a", "e", "i", "o", "u", "u:", "A", "E", "I", "O", "U", "U:"]
    # the dictionary below will convert the vowels to vowels with accents
    # depening on their tone as specified at the end of the word
    vowel_dict: Dict[int, List[str]] = {
        1: ["ā", "ē", "ī", "ō", "ū", "ǖ", "Ā", "Ē", "Ī", "Ō", "Ū", "Ǖ"],
        2: ["á", "é", "í", "ó", "ú", "ǘ", "Á", "É", "Í", "Ó", "Ú", "Ǘ"],
        3: ["ǎ", "ě", "ǐ", "ǒ", "ǔ", "ǚ", "Ǎ", "Ě", "Ǐ", "Ǒ", "Ǔ", "Ǚ"],
        4: ["à", "è", "ì", "ò", "ù", "ǜ", "À", "È", "Ì", "Ò", "Ù", "Ǜ"],
        5: ["a", "e", "i", "o", "u", "ü", "A", "E", "I", "O", "U", "Ü"],
    }
    tone: int = ord(word[-1]) - 48
    pos: Union[int, None]
    pinyin_word: str
    if 0 < tone < 6:
        word_without_tone = word[0:-1]
        if tone < 5:
            # the following vowels/pairs always get the marker
            search_list: List[str] = ["a", "e", "ou"]
            # check if the word_without_tone has any of them
            found: List[bool] = [vowel in word_without_tone for vowel in search_list]
            if any(found):
                vowel: str = search_list[found.index(True)]
                pos = word_without_tone.find(vowel)
            else:
                pos = last_vowel(word_without_tone)

            # now we need to check whether the vowel position is
            # followed by : since we would have to consider two letters
            if pos is not None:
                to_replace: str
                try:
                    if word_without_tone[pos + 1] == ":":
                        to_replace = word_without_tone[pos : pos + 2]
                    else:
                        to_replace = word_without_tone[pos : pos + 1]
                except IndexError:
                    to_replace = word_without_tone[pos : pos + 1]
                pinyin_word = word_without_tone.replace(
                    to_replace, vowel_dict[tone][vowels.index(to_replace)]
                )
            else:
                pinyin_word = word_without_tone
        else:
            pinyin_word = word_without_tone.replace("u:", "ü")
            pinyin_word = word_without_tone.replace("U:", "Ü")
    else:
        pinyin_word = word
    return pinyin_word

It takes in a string containing pinyin with tone numbers. It looks at the last character in the string and returns the Unicode code point for it and then subtracts 48. Why? It avoids me having to do a try int(word[-1]) and an exception following that. Then I check if the tone value is between 0 and 6 (both exclusive). If it is not, that means there is no pinyin tone number and the word is simply returned. If it is, more follows. We specify that the word_without_tone is simply the word without the last character. Then we have a if condition that checks if the tone number is less than 5 (i.e. 1, 2, 3 or 4). If it’s 5, it means it’s neutral so no tone accents are needed. That being said, if the word does contain “u:” or “U:”, they are replaced by “ü” or “Ü” respectively.

If it does have a tone work, we first check for a few possible vowels of pairs of vowels that always get the marker: a, e or ou. If they are present, the tone goes on that. Otherwise, we look for the last vowel. We do that by using the last_vowel function.


def last_vowel(word: str) -> Union[int, None]:
    """
    This function returns the position of the last vowel in a word.
    """
    # another way to reverse a string is ''.join(list(reversed(word)))
    vowels: List[str] = ["a", "e", "i", "o", "u", "u:", "A", "E", "I", "O", "U", "U:"]
    reverse_word: str = word[::-1]
    for character in reverse_word:
        for vowel in vowels:
            if vowel in character:
                return word.find(vowel)
    return None

Is it the quickest way to find it? No. Is it the best way to write it? No. Is it easier to understand like this? yes. This is the way. (my way at least)

So we get the position of the last vowel (if it exists). Then, we still have to check if the vowel is followed by a : as we’d have to add the two dots on top of the letter then (I can never remember what those two dots are called). Then we replace, and get the correct word. Et voilá. We have our pinyin. If, however, we want to have pinyin without numbers or tone marks, we use:


def convert_to_pinyin_clean(word: str) -> str:
    """
    This functions converts from pinyin with numbers to pinyin
    without numbers or accents.
    """
    tone: int = ord(word[-1]) - 48
    pinyin_clean: str
    if 0 < tone < 6:
        pinyin_clean = word[0:-1]
    else:
        pinyin_clean = word
    return pinyin_clean

Where are these functions used? I use them in a script that extracts the information from the CEDICT file, combines it with data from the frequency file of words and then adds the words to the database in the “words” table. But that will be the next post!