Why are chats a Chinese talk suddenly? – RCO NEWS Daily world news agency Based on Dubai, UAE

Contents

Causes 1. Token 2. Compact structure of Chinese 1. Text Coding Problem (Mojibake)1. Bias in educational data 1. Unstable at the time of hesitation Current solutions 1. Tokenazer update 2. Remove Slow mode tokenizer mode 2. Correction of Dialogue Molding 1. Prerequisite to restrict language Conclusion

Study time: 2 Minutes

Many ChatB users report that they suddenly see Chinese characters instead of Persian or English. This technical mistake disrupts the user experience and has raised important questions about the way language processing in artificial intelligence models. In the following, we explain the main causes of this problem and temporary solutions in a simple and example.

Suppose you type the message “Hello” in chattes, but the output appears as “1”!

Causes

1. Token

Large language models such as ChatGpt and Claude have been trained. In the “token” process (breaking the text to processing units), Chinese language is quickly replaced by Persian or English due to high abundance in the specific data and structure of each character.

2. Compact structure of Chinese

Unlike Farsi and English, the Chinese is not distant between words, and each character can be meaningful. Most tokenazers are designed for distances; So if the model enters the production of Chinese text, it will produce a chain of the same characters quickly.

1. Text Coding Problem (Mojibake)

Sometimes the software or browser itself displays the text with the wrong coding (eg Saved with UTF-8 and read with GBK). In this case, instead of Persian/English letters, “meaningless characters” (often Chinese) are seen.

1. Bias in educational data

A significant portion of the LLM educational data is dedicated to English and Chinese. Compression methods such as byte-pair encoding or wordpiece in Chinese may produce extra or wrong tokens. When the input fails, the model turns to a language whose tokens have been more in education.

1. Unstable at the time of hesitation

Research shows that in error or hesitation, multifunctional models switched in repeated and low -key languages such as Chinese and produce unrelated output.

Current solutions

1. Tokenazer update

Using new tokenzer versions (eg CL100K_base) that place Chinese repeated characters in a single token and prevent sudden chain production.

2. Remove Slow mode tokenizer mode

In some open source models, low -trained subtitles are removed from the output cycle by deactivating the “Slow Tokenizer” mode.

2. Correction of Dialogue Molding

The probability of unwanted switching has been reduced by precisely adjusting the formatting parameters and removing the extra distances between the roles (eg user and model).

1. Prerequisite to restrict language

Many developers use phrases like “Please Answer in Farsi” before the user input to keep the model in a specific language.

Conclusion

Despite local solutions, the bug is still found in large services (such as the official version of ChatGpt), and the main manufacturers have not made a statement to completely fix it. It is recommended that it is recommended for an integrated token and architecture that eliminates multiple ambiguities:

When the problem arises, use the explicit draft to determine the output language.

Wait for formal updates to make fundamental modification at the model level.

Latest Passing over countries : Spain | Dominica | United Arab Emirates

RCO NEWS

Causes

1. Token

2. Compact structure of Chinese

1. Text Coding Problem (Mojibake)

1. Bias in educational data

1. Unstable at the time of hesitation

Current solutions

1. Tokenazer update

2. Remove Slow mode tokenizer mode

2. Correction of Dialogue Molding

1. Prerequisite to restrict language

Conclusion

RCO News

Leave a Reply Cancel reply

Editor's Pick

Top Writers

Oponion

You Might Also Like

Other News

Technology

Immigration

Travel

More

Subscribe