Creating Chatbots in African Languages

0

[ad_1]

The field of natural language processing (NLP) has advanced the furthest in the most widely-used languages like English and Russian. But an emerging body of research is focused on training AI models using African languages.

Thanks to such efforts, the dream of an African language chatbot is edging closer to reality.

Chatbot Research Dominated by English Language

Natural language processing and the large language models that power chatbots like ChatGPT are still relatively new technologies. And to date, research and development has focused on the most spoken languages. 

For example, ChatGPT is available in English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Arabic, and Chinese. 

The tendency toward language dominance in AI research is largely driven by data availability.

It is estimated that over half of all written content available online is in English. Accordingly, of the datasets needed to train language models, the largest and most readily available are in English, followed by the other most popular languages.

African Languages Pose a Challenge for AI Researchers 

Currently, the world’s largest AI firms are battling it out to build the most advanced chatbots for a handful of languages. But another sphere of research is looking to develop AI tools for less popular languages.

For African languages, the limited availability of training data presents a significant challenge for AI developers.

The linguistic diversity of many African countries further complicates things. For example, South Africa has 11 official spoken languages, and there are thirty-five languages indigenous to the country. With around 2000 languages in use on the continent, amassing vast digital content libraries on an equivalent scale to English would be nearly impossible

Languages spoken in Africa Chatbot Research
Representation of African Linguistic Diversity (Source: ACL Anthology)

Moreover, one recent study identified the lack of basic digital language tools as a factor that inhibits content creation. As the authors observed:

“Creating digital content in African languages is frustrating due to a lack of basic tooling such as dictionaries, spell checkers, and keyboards.”

Nevertheless, efforts are underway to increase the availability of African language data, for instance, by digitizing archival language repositories and making more datasets freely accessible. The work of content creators, curators, and translators is also critical.

Multilingual Models Could Make African Language Chatbots a Reality

Although lacking training data has certainly held African language NLP research back, multilingual pre-trained language models (mPLMs) could help researchers overcome this challenge.

Pre-trained models can be thought of as the building blocks of high-functioning chatbots. However, they still require task-specific fine-tuning in order to deliver conversational outputs.

By acquiring generalizable linguistic information during pretraining, multilingual models are able to interpret the basic structure and outline of related languages without the massive training datasets normally required.

Unsurprisingly, one recent study has shown that language similarity improves model performance. Just like speakers of related languages can often understand each other, models trained with one language can interpret similar languages accurately.

Using this approach, researchers developed an mPLM they called SERENGETI, which covers 517 African languages and language varieties.

This represents a major technological leap forward and a significant improvement on the 31 previously covered African languages.

Disclaimer

In adherence to the Trust Project guidelines, BeInCrypto is committed to unbiased, transparent reporting. This news article aims to provide accurate, timely information. However, readers are advised to verify facts independently and consult with a professional before making any decisions based on this content.

[ad_2]

Source link

Leave A Reply

Your email address will not be published.

bitcoin
Bitcoin (BTC) $ 65,225.00
ethereum
Ethereum (ETH) $ 1,757.87
tether
Tether (USDT) $ 0.999123
bnb
BNB (BNB) $ 605.42
usd-coin
USDC (USDC) $ 0.999639
xrp
XRP (XRP) $ 1.20
solana
Solana (SOL) $ 72.36
tron
TRON (TRX) $ 0.320416
figure-heloc
Figure Heloc (FIGR_HELOC) $ 1.04
staked-ether
Lido Staked Ether (STETH) $ 2,265.05
hyperliquid
Hyperliquid (HYPE) $ 71.88
dogecoin
Dogecoin (DOGE) $ 0.086312
usds
USDS (USDS) $ 0.999661
leo-token
LEO Token (LEO) $ 9.67
rain
Rain (RAIN) $ 0.013973
zcash
Zcash (ZEC) $ 482.21
wrapped-steth
Wrapped stETH (WSTETH) $ 2,779.67
stellar
Stellar (XLM) $ 0.224766
monero
Monero (XMR) $ 341.81
wrapped-bitcoin
Wrapped Bitcoin (WBTC) $ 76,243.00
canton-network
Canton (CC) $ 0.164349
binance-bridged-usdt-bnb-smart-chain
Binance Bridged USDT (BNB Smart Chain) (BSC-USD) $ 0.998762
whitebit
WhiteBIT Coin (WBT) $ 53.60
wrapped-beacon-eth
Wrapped Beacon ETH (WBETH) $ 2,466.93
cardano
Cardano (ADA) $ 0.168651
chainlink
Chainlink (LINK) $ 8.18
usd1-wlfi
USD1 (USD1) $ 1.00
wrapped-eeth
Wrapped eETH (WEETH) $ 2,465.31
ethena-usde
Ethena USDe (USDE) $ 0.999179
the-open-network
Gram (prev. Toncoin) (GRAM) $ 1.65
susds
sUSDS (SUSDS) $ 1.08
bitcoin-cash
Bitcoin Cash (BCH) $ 212.83
dai
Dai (DAI) $ 0.999753
lab
LAB (LAB) $ 13.01
coinbase-wrapped-btc
Coinbase Wrapped BTC (CBBTC) $ 76,366.00
memecore
MemeCore (M) $ 3.05
hedera-hashgraph
Hedera (HBAR) $ 0.081028
litecoin
Litecoin (LTC) $ 45.31
weth
WETH (WETH) $ 2,268.37
sui
Sui (SUI) $ 0.790204
hashnote-usyc
Circle USYC (USYC) $ 1.13
near
NEAR Protocol (NEAR) $ 2.31
usdt0
USDT0 (USDT0) $ 0.998824
avalanche-2
Avalanche (AVAX) $ 6.85
shiba-inu
Shiba Inu (SHIB) $ 0.000005
global-dollar
Global Dollar (USDG) $ 1.00
paypal-usd
PayPal USD (PYUSD) $ 0.999908
crypto-com-chain
Cronos (CRO) $ 0.059325
tether-gold
Tether Gold (XAUT) $ 4,318.69
bittensor
Bittensor (TAO) $ 253.81
Shares