This is entirely unsurprising and in-line with the finding that even small specialized models do better in information extraction and text classification. So no wonder finetuned large LMs do good too.
Personally, my PhD did fine grained ACE-like event and sentiment extraction and "small" specialized finetuned transformers outperformed prompting LLMs like BERT and Roberta-large. Would love to see an inclusion of small model scores with some sota pipelines.
This is great work anyway even if it replicates known results!
Your thesis sounds interesting! Do you have a link to it by any chance?
rovr beat me to it below. Here are more links: https://jacobsgill.es/phdobtained (fun fact: because my thesis contains published papers, I am in breach of a few journal's copyright by uploading my own thesis pdf, but fuck'em).
LLM approaches were evaluated on my own time and but published (I left research after obtaining my PhD).
I don't know your circumstances but often you retain the right to distribute a "post print", ie the final text as published but absent journal formatting. A dissertation should fit that definition.
This is indeed often the case, however, my university reviews each thesis, and deemed it can only change to open access in 2026 (+5 years from defense).
I think this is default policy for thesis based on publication agreements here.
In any case, I am not too worried.
Thank you for the link! And congratulations on obtaining your PhD
I have skimmed through it and it's truly amazing how good annotation of the dataset can lead to impressive results.
I apologise in advance if the question seems ignorant: The blog post talked about fine-tuning models online. Given that BERT models can run comfortably on even iPhone hardware, were you able to finetune your models locally or did you have to do it online too? If so, are there any products that you recommend?
Thanks! The fine-tunes where done in 2019-21 on a 4xV100 server with hyperparameter search, so thousands of individual fine-tuned models were trained in the end. I used weights and biased for experiment dashboarding the hyperparam search, but the hardware was our own GPU server (no cloud service used).
I doubt you can fine-tune BERT-large on a phone. A quantized, inference optimised pipeline can be leaps and bounds more efficient and is not comparable with the huggingface training pipelines on full models I did at the time. For non-adapter based training you're going to need GPUs ideally.
Excluding the part in the middle because I don't wanna repost potential issues for you. I just wanted to comment that that is terrible. People often talk about the siloed nature of research in industry, without considering that academia supports the draconian publishing system. I understand IP protection, but IP protection doesn't have to mean no access. This is such a huge issue in the bio- world (biostats, genetics, etc).
This is really cool -- thanks for posting it! I'll have to skim through it at some point since a lot of my work is in classifications models and mirrors the results you've seen
Seconded! Any URI to your PhD?
Check https://www.researchgate.net/publication/356873749_Extractin...
Check https://www.researchgate.net/publication/356873749_Extractin...
The caveat here is that if you don't know how to create good specialized models - you are just wasting everyone't time and money:
https://www.threads.net/@ethan_mollick/post/C46AfItO8RS?hl=e...
Exactly, BloombergGPT performed worse on financial sentiment analysis then much smaller fine-tuned Bert-based models.
For many extractive tasks BloombergGPT was quite disappointing. A 5-10% performance hit with much larger inference cost compared to smaller models is not desirable.
But the research investment for Bloomberg makes sense to take the risk: a do-it-all generative model can mean significant complexity reduction in maintenance and deployment overhead.
It didn't directly pay off for many extractive tasks, but I bet they're iterating. Bloomberg has the data moat and the business needs in their core products to make it worthwhile.