Among the limited Chinese tokens in GPT-4o that are not inappropriate, two are “socialism with Chinese characteristics” and “People’s Republic of China.” This indicates that a significant portion of the training data may come from Chinese state media, where formal expressions are common.
OpenAI has been secretive about the data used to train its models, especially regarding the amount of state media versus spam in its Chinese training database. Others in China’s AI industry also face challenges with obtaining quality Chinese text data sets for LLM training due to limited data sharing among big companies like Tencent and ByteDance.
This lack of quality training data is a bigger issue than filtering out inappropriate content in GPT-4o’s training data. While OpenAI may not have curated their data sets given restrictions in China, there is a demand for properly functioning AI services in Chinese from users outside China.
To address the shortage of good Chinese LLM training data, share your ideas at zeyi@technologyreview.com.