INSAIT releases state-of-the-art language models for Bulgarian setting a new standard for open national LLMs

INSAIT с нови езикови модели за български, установявайки нов стандарт за големи езикови модели на национално ниво

This is the first time open LLMs of practical size for a target language surpass in performance much larger open models while being competitive for chatting with paid platforms such as OpenAI and Anthropic.

За пръв път отворени езикови модели с практичен размер за конкретен език надминават по ефективност много по-големи отворени модели, като същевременно са конкурентоспособни за чат с платени платформи като OpenAI и Anthropic.

November 19, 2024

19 ноември 2024 г

INSAIT is delighted to announce the release of three new state-of-the-art AI models, a 27 billion, a 9 billion, and a small 2.6 billion parameter models, targeting the Bulgarian language (called BgGPT). The 27B and 9B models demonstrate unprecedented performance in Bulgarian, outpacing much larger ones, while retaining English language capabilities. Beyond benchmarks, INSAIT’s 27B model significantly surpasses GPT-4o-mini and rivals GPT-4o in Bulgarian chat performance, according to GPT-4o itself used as a judge. We observe similar results with Anthropic’s Claude Haiku and Sonnet paid models.

INSAIT обявява пускането на три нови AI модела на световно ниво - 27 милиарден (BgGPT 27B), 9 милиарден (BgGPT 9B) и малък модел с 2.6 милиарда параметъра (BgGPT 2.6B), специално за български език. Моделите BgGPT 27B и 9B демонстрират безпрецедентни резултати върху български тестове, надминавайки много по-големи модели, като същевременно запазват способностите си на английски език. Отвъд резултатите на тестовете за български, допълнително тествахме и способностите на моделите за чат. Според самия GPT-4o, използван като съдия, BgGPT 27B значително надминава GPT-4o-mini и се конкурира с GPT-4o при чат на български. Наблюдаваме подобни резултати с платените модели на Anthropic – Claude Haiku и Claude Sonnet.

INSAIT’s three new models are built on top of Google’s Gemma 2 family of open models and were pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge continual pre-training strategy that INSAIT invented and presented at EMNLP’24 [1]. Further, the models were instruction-fine-tuned on a novel Bulgarian instruction dataset created using real-world conversations collected from https://chat.bggpt.ai/.

Моделите на INSAIT са базирани на семейството от отворени модели на Google – Gemma 2 и са обучени на около 100 милиарда токена (85 млрд. от тях на български). Обучението е извършено с помощта иновативен метод, основан на сливане на модели, който INSAIT изобрети, описа и представи на EMNLP’24 [1]. Допълнително моделите са специално настроени, използвайки български набор от данни с инструкции от чат приложението https://chat.bggpt.ai/.

New Research: Avoiding Catastrophic forgetting via Branch-and-Merge

Справяне с катастрофално забравяне чрез разклоняване и сливане

The key to the performance of the new line of BgGPT models is a novel branch and merge algorithm we presented at EMNLP’24 [1]. This method ensures the model can learn new skills (e.g., Bulgarian) while retaining old ones (e.g., English, mathematics, few-shot capabilities) in the base model (e.g., Gemma-2 in this case).

Ключът към ефективността на новите модели BgGPT е алгоритъмът за разклоняване и сливане, който беше представен на EMNLP’24 [1]. Този метод позволява на модела да научава нови умения (например български език), като същевременно запазва старите (например английски език, математика или “few-shot” способности), налични в основния модел (в случая Gemma-2).

At a high level, the method works by splitting the continuous pre-training datasets into several splits (denoted with G in the figure above), training a model on each split, and then merging the resulting models, thus mitigating the forgetting associated with the different split models. Generally, the BgGPT models are developed as a series of model merges, as demonstrated in the figure above – here we show the continual pre-training stage of BgGPT, followed by an instruction fine-tuning stage. We remark that our methodology is general and can be applied beyond Bulgarian and as shown in the paper [1].

В основни линии методът работи като разделя тренировъчните данни на няколко части (обозначени с G в горната фигура) и тренира отделни модели на всеки един от тях, които след това се сливат, за да се получи крайният модел. С помощта на този метод, до голяма степен се избягва катастрофалното забравяне, което обикновено се случва при обучението на само един модел.

Процесът на разработка на моделите BgGPT се състои от поредица от сливания на модели, както е демонстрирано в горната фигура, където показваме първия етап на тренирането - непрекъснатото дообучение. Следва етап на обучение върху български набор от данни с инструкции от чат приложението. Отбелязваме, че методологията ни е общоприложима и може да бъде използвана както за български, така и за други езици, както и се описва в статията ни [1].

Benchmarks

Тестове на български и английски език

We evaluate on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:

Анализирахме способностите на моделите върху набор от стандартни тестове на английски, техни преводи на български, както и специфични тестове на български, които сме събрали:

Winogrande challenge [2]: тества общи познания и разбиране
Hellaswag [3]: тества способността за довършване на изречения
ARC Easy/Challenge [4]: тества логически разсъждения
TriviaQA [5]: тества фактологични знания
GSM-8K [6]: тества въпроси по математика на гимназиално ниво с множествен избор на отговора
Exams [7]: тества решаване на задачи на гимназиално ниво за природни и социални науки
MON: включва изпити по предмети от 4 до 12 клас
Winogrande challenge [2]: testing world knowledge and understanding
Hellaswag [3]: testing sentence completion
ARC Easy/Challenge [4]: testing logical reasoning
TriviaQA [5]: testing trivia knowledge
GSM-8K [6]: solving multiple-choice questions in high-school mathematics
Exams [7]: solving high school problems from natural and social sciences
MON: contains exams across various subjects for grades 4 to 12

These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models.

Този набор от данни тества логическото разсъждаване, цялостните математически познания, езиковото разбиране и други умения на моделите.

Evaluation Results of 9B and 27B models on Benchmarks: Bulgarian and English

Резултати на BgGPT 9B и BgGPT 27B на български и английски език

The graphs above show the performance of BgGPT 9B and BgGPT 27B compared to other large open models. The results show the excellent abilities of both 9B and 27B models in Bulgarian, which allow them to outperform much larger models, including Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B. Further, both BgGPT 9B and BgGPT 27B significantly improve upon the previous version of BgGPT based on Mistral-7B (BgGPT-7B-Instruct-v0.2, shown in grey in the figure). Finally, our models retain the excellent English performance inherited from the original Google Gemma 2 models upon which they are based.

Графиките показват резултатите на BgGPT 9B и BgGPT 27B в сравнение с други големи отворени модели. Резултатите на български сочат, че уменията и на двата модела превъзхождат тези на много по-големи такива като Qwen 2.5 72B на Alibaba и Llama3.1 70B на Meta. Също така BgGPT 9B и 27B значително надграждат над предишната версия на BgGPT – BgGPT-7B-Instruct-v0.2, базирана на Mistral-7B.

Benchmark evaluation results for the BgGPT-2.6B model

Резултати на BgGPT 2.6B на български и английски език

In addition to BgGPT 9B and BgGPT 27B, INSAIT is also releasing BgGPT 2.6B, which is a state-of-the-art small language model for Bulgarian based on the Gemma-2 2.6B model. Above, we show that on Bulgarian benchmarks, BgGPT 2.6B significantly improves over existing small language models such as Microsoft’s Phi 3.5 and Alibaba’s Qwen 2.5 3B. Again, as with the other BgGPT models, it retains the English capabilities of the underlying Gemma 2 2.6B model.

Освен BgGPT 9B и BgGPT 27B, INSAIT пуска и BgGPT 2.6B, малък езиков модел на български, базиран на Gemma-2 2.6B. На графиката показваме, че на нашия български набор от тестове, BgGPT 2.6B се справя значително по-добре от модели със същия размер като Phi 3.5 на Microsoft и Qwen 2.5 3B на Alibaba. Отново, както и с другите BgGPT модели, BgGPT 2.6B запазва способностите на Gemma 2 2.6B на английски език.

Chat Preference Evaluation of BgGPT 27B model vs. OpenAI and Anthropic models

Чат предпочитания на български език спрямо комерсиалните модели на OpenAI и Anthropic

In addition to benchmark evaluation, we evaluated the BgGPT 27B model in terms of chat performance on thousands of real-world questions from around 100 different topics. The results show that our model significantly surpasses the performance of the smaller variants of paid models, such as Antropic’s Claude Haiku and OpenAI’s GPT-4o-mini in Bulgarian chat performance, and is on par with the best commercial models, such as Antropic’s Claude Sonnet and OpenAI’s GPT-4o according to GPT-4o itself.

Освен българските тестове, проверихме качеството на BgGPT 27B и по отношение на отговори на чат въпроси на около 100 различни теми. Резултатите показват, че той значително надминава по-малките варианти на моделите на Anthropic (Claude Haiku) и OpenAI (GPT-4o-mini) при работа на български език, и е наравно с най-добрите комерсиални модели като Claude Sonnet и GPT-4o според самия GPT-4o.

Изтегляне

Предоставяме две версии на всички модели на HuggingFace – нормална и с квантувани тегла, заедно с подробно описание за тяхната употреба. Изтеглете ги тук.

Download

We make normal and quantized versions of the models available on HuggingFace, alongside a detailed description of how to use them for inference. Dowload them here.

References

Препратки

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging, Anton Alexandrov, Veselin Raychev, Mark Niklas Mueller, Ce Zhang, Martin Vechev, Kristina Toutanova. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17167–17186, Miami, Florida, USA. Association for Computational Linguistics. https://aclanthology.org/2024.findings-emnlp.1000
Winogrande: An adversarial winograd schema challenge at scale, Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Communications of the ACM, 64(9):99–106, 2021.
Hellaswag: Can a machine really finish your sentence?, Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. https://arxiv.org/abs/1905.07830
Think you have solved question answering? try arc, the ai2 reasoning challenge, Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. https://arxiv.org/abs/1803.05457
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. https://arxiv.org/abs/1705.03551
Training verifiers to solve math word problems, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. https://arxiv.org/abs/2110.14168
EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering, Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, and Preslav Nakov. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.438/

Моделът зад приложението за чат BgGPT вече е публикуван

3 март 2024 г

(Този текст е автоматично генериран от модела от английската версия на блога. [*])

В INSAIT сме развълнувани да пуснем BgGPT-7B-Instruct-v0.2, модела, който стои зад приложението за чат BgGPT: https://chat.bggpt.ai. Този модел, част от серията BgGPT, е подобрена версия на тази, която пуснахме преди няколко седмици. BgGPT-7B-Instruct-v0.2 все още е 7B модел, което го прави много бърз за генериране на текст и може да работи на повечето съвременни персонални компютри. Освен това идва с лиценз Apache 2.0, който е свободен и подходящ за търговски цели. Моделът се основава на Mistral-7B, но беше обучен върху значителни количества данни и комбиниран с други нововъведения (които ще бъдат публикувани в изследователски конференции), може да надмине много по-големи модели на задачи на български език. Обучението на BgGPT-7B-Instruct-v0.2 се финансира изцяло от частни средства и дарения. Моля, вижте блога ни за BgGPT-7B-Instruct-v0.1, който пуснахме по-рано.

Успешна история на BgGPT

През последните 2 седмици BgGPT-7B-Instruct-v0.1 вече е приет от различни компании, които са коментирали, че с малко часове работа и ниски разходи за изчислителни ресурси за фина настройка, той може да достигне производителността на GPT-4 на конкретна задача на български език.

Оценяване и бенчмаркове

Както при много други езикови модели, ние оценяваме на набор от стандартни превeдени на български тестове, както и английски тестове:

Winogrande предизвикателство [1]: тестване на разбиране на света
Hellaswag [2]: тестване на завършване на изречения
ARC Challenge [3]:тестване на логическо разсъждение
MMLU [4]: включва множество изборни въпроси от много области
MathQA [5]: тестване на математическо разсъждение
GSM8K [6]: решаване на задачи с множество избора в гимназиалната математика
TriviaQA [7]: тестване на знания за тривия
bgGLUE [8]: включва няколко задачи на български език

Тези тестове тестват логическото разсъждение, математическите умения, знанията, разбирането на езика и други умения на модела.

Резултати от оценката

Следните графики показват представянето на BgGPT-7B-Instruct-v0.2. Той надминава моделите със същия размер на българските бенчмаркове, включително подобрява предишната версия на BgGPT-7B (BgGPT-7B-Instruct-v0.1). Той също така надмина по-големия Mixtral-8x7B-Instruct-v0.1 на българските бенчмаркове. Той запази своите английски умения и в някои отношения е сравним или по-добър от моделите на Gemma-7B на Google, Mistral-7B, Llama-7B и др.

Изгледи

Въпреки че моделът е доста конкурентен на безплатните отворени модели и особено като се има предвид неговият размер, той все още не е на нивото на комерсиалните платени предложения. Въпреки това, дори на сегашното си ниво, той може да бъде полезен за много приложения.

[*] Преводът е извършен в 2 стъпки. Първо попитахме: “Преведи на български език следния текст:” и поставяме английската версия на текста без заглавието. След това в същия чат попитахме “Направи го да звучи по-точно”.

Препратки

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? https://arxiv.org/abs/1905.07830
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. https://arxiv.org/abs/1803.05457
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. https://arxiv.org/abs/2009.03300
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms https://arxiv.org/abs/1905.13319
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. https://arxiv.org/abs/1705.03551
Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova, Kiril Simov, Petya Osenova, Veselin Stoyanov, Ivan Koychev, Preslav Nakov, and Dragomir Radev. bgGLUE: A Bulgarian general language understanding evaluation benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8733–8759 https://bgglue.github.io/

The model behind the BgGPT chat is now published

March 3, 2024

At INSAIT we are delighted to release BgGPT-7B-Instruct-v0.2, the model used behind the BgGPT chat app: https://chat.bggpt.ai. This model, part of the BgGPT series of models, is an improved version of the one we released a couple of weeks ago. BgGPT-7B-Instruct-v0.2 is still a 7B model, which is very fast for text generation and can run on most recent personal computers. It also comes with a permissive and commercial-friendly Apache 2.0 licence. The model is based on Mistral-7B, but was trained on significant amounts of data, and combined with other advances (to be published in research conferences), can outperform much larger models on Bulgarian tasks. The training costs of BgGPT-7B-Instruct-v0.2 come entirely from private funds and donations. Please see the blog post for BgGPT-7B-Instruct-v0.1 we released earlier.

BgGPT Success Story

In only 2 weeks, BgGPT-7B-Instruct-v0.1 has already been adopted by various companies who remarked that with only few hours of work and low computation and financial resources for fine-tuning, it can reach the performance of GPT-4 on a particular task in Bulgarian.

Evaluation & Benchmarks

As with many other language models, we evaluate on a set of standard benchmarks translated to Bulgarian as well as on English benchmarks:

Winogrande challenge [1]: testing world understanding
Hellaswag [2]: testing sentence completion
ARC Challenge [3]: testing logical reasoning
MMLU [4]: including multiple choice questions from many disciplines
MathQA [5]: testing math reasoning
GSM8K [6]: solving multiple-choice questions in high-school mathematics
TriviaQA [7]: testing trivia knowledge
bgGLUE [8]: includes several Bulgarian language tasks

These benchmarks test the logical reasoning, math, knowledge, language understanding and other skills of the model.

Evaluation Results

The following graphs show the performance of BgGPT-7B-Instruct-v0.2. It outperforms same-sized models on Bulgarian benchmarks, including improving upon the previous version of BgGPT-7B (BgGPT-7B-Instruct-v0.1). It also outperformed the much larger Mixtral-8x7B-Instruct-v0.1 on Bulgarian benchmarks. It also did not lose English skills and on some is comparable or better than the models of Google’s Gemma-7B, Mistral-7B, Llama-7B and others.

Outlook

Note that while the model is quite competitive to free open-source models, and especially for its size, it is still not on the level of paid commercial offerings. Yet, even at the current level, it can be useful for many applications.

References

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? https://arxiv.org/abs/1905.07830
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. https://arxiv.org/abs/1803.05457
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. https://arxiv.org/abs/2009.03300
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms https://arxiv.org/abs/1905.13319
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. https://arxiv.org/abs/1705.03551
Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova, Kiril Simov, Petya Osenova, Veselin Stoyanov, Ivan Koychev, Preslav Nakov, and Dragomir Radev. bgGLUE: A Bulgarian general language understanding evaluation benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8733–8759 https://bgglue.github.io/

Launching the first free and open Bulgarian LLM

February 18, 2024

At INSAIT we are thrilled to launch BgGPT-7B-Instruct-v0.1, the first free and open Bulgarian Large Language Model in the BgGPT series (more models coming soon). BgGPT-7B-Instruct-v0.1 is now available for download at HuggingFace with the permissive and commercial-friendly Apache 2.0 licence. The model, which builds on Mistral-7B, already outperforms similarly sized models such as LLaMA2-7b and Mistral-7B on all Bulgarian language tasks. On many of these tasks, It also outperforms much larger models such as Mixtral-8x7B-Instruct-v0.1 (about 6.5 times larger), which has been shown to have similar capabilities as GPT-3.5.

Evaluation & Benchmarks

To systematically evaluate the Bulgarian performance of LLMs, including our model and any existing or future models, we translated a set of benchmarks to Bulgarian, including:

Winogrande challenge [1]: testing world understanding
Hellaswag [2]: testing sentence completion
ARC Challenge [3]: testing logical reasoning
MMLU [4]: including multiple choice questions from many disciplines
MathQA [5]: testing math reasoning
GSM8K [6]: solving multiple-choice questions in high-school mathematics
TriviaQA [7]: testing trivia knowledge
bgGLUE [8]: includes several Bulgarian language tasks

These benchmarks (except the last one which already exists) were built via both machine translation as well as our amazing team of translators. For evaluation, we forked a version of the EuletherAI's evaluation harness. All benchmark data is made publicly available in our HF repository to help others evaluate their own models.

Note on evaluation: great care should be taken to not contaminate training or fine-tuning datasets by including the above benchmarks (generally known as overfitting, but a threat recently explored in detail here [9]), which can lead to misreported results.

Evaluation Results

The following graphs show the performance of BgGPT-7B-Instruct-v0.1. It clearly outperforms same-sized models on Bulgarian benchmarks as well as on most other benchmarks. It also outperformed the much larger Mixtral-8x7B-Instruct-v0.1 on Bulgarian benchmarks. That said, the model does not excel at deep reasoning and knowledge skills, though this is somewhat expected as smaller models can learn less which is reflected in the knowledge-testing benchmarks. We expect this to improve in the BgGPT that will follow. Interestingly, even though the model is biased to Bulgarian, it does retain some English skills, making it a versatile tool for cross-lingual tasks including translation from English to Bulgarian. Here we include a gist of the benchmark results.

Outlook

While larger models will in general offer superior performance, we see that specialised, smaller 7B models can actually produce similar results to non-specialized much larger models, while enjoying much cheaper inference costs. Further, for many business applications, smaller models may suffice. Over the next weeks, we will release improved models, so stay tuned!

Institutional use of BgGPT

If you are an institution or a business organisation interested in using BgGPT internally and have questions on how to do so, please contact us at: bggpt@insait.ai

References

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? https://arxiv.org/abs/1905.07830
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. https://arxiv.org/abs/1803.05457
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. https://arxiv.org/abs/2009.03300
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms https://arxiv.org/abs/1905.13319
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. https://arxiv.org/abs/1705.03551
Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova, Kiril Simov, Petya Osenova, Veselin Stoyanov, Ivan Koychev, Preslav Nakov, and Dragomir Radev. bgGLUE: A Bulgarian general language understanding evaluation benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8733–8759 https://bgglue.github.io/
Evading Data Contamination Detection for Language Models is (too) Easy, Dekonick et. al. https://arxiv.org/abs/2402.02823