Currently submitted to: Journal of Medical Internet Research
Date Submitted: Feb 2, 2026
Open Peer Review Period: Feb 3, 2026 - Mar 31, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Extracting Quality of Life Information from Forum Posts Using Open-Source Large Language Models: Feasibility Study
ABSTRACT
Background:
Quality of Life (QoL) questionnaires are an established instrument designed to assess overall wellbeing and quality of life of patients. They are important in predicting the outcome of the disease and understanding the needs of individual patients. However, their repeated collection imposes substantial burden on both patients and clinical professionals. Many patients seek emotional support and mutual exchange in online communities for peer-support, where they frequently share detailed descriptions of symptoms and treatment experiences, addressing topics covered in QoL questionnaires. The emergence of large language models (LLMs) uncover potential for automatic extraction of relevant QoL information from patient-generated text.
Objective:
The aim of this study is to evaluate and compare various open-source LLMs and optimization approaches for automated extraction of QoL information from forum posts.
Methods:
The dataset consisted of 2,683 English-language posts from breast cancer patients recruited on Inspire.com online communities, manually annotated with sentence-level text spans indicating whether and where posts contained information relevant to 53 QoL questions from EORTC QLQ-C30 and QLQ-BR23 questionnaires. 11 open-source LLMs (8B-70B parameters) were evaluated in a zero-shot setup, generating 4,452 post-question predictions per model under two input conditions: post-only and post with additional context. For the best-performing model, additional experiments assessed the impact of chain-of-thought prompting, instruction optimization, few-shot prompting and parameter-efficient fine-tuning. For correctly classified yes/no instances, the overlap between model-generated evidence and human-annotated spans was evaluated.
Results:
Across 11 evaluated LLMs, GPT-OSS 20B achieved the highest macro F1-score (0.79) in the zero-shot post-only setting. Providing additional context consistently reduced performance of all models. Model size did not correlate with F1-score, with several mid-sized models (14B-30B) outperforming 70B models. For GPT-OSS 20B, chain-of-thought prompting did not improve performance (0.77). Instruction optimization produced results similar to the baseline in both zero-shot and few-shot settings (0.78-0.80). Bootstrap few-shot prompting with random search achieved the highest score overall (0.81). Parameter-efficient fine-tuning decreased performance (0.71). Most classification errors occurred in semantically broad or ambiguous terms and the fallback question. For correctly predicted yes/no answers, model-generated evidence matched or partially matched human-annotated spans in 89% of cases.
Conclusions:
Open-source LLMs are a promising tool for extracting QoL information that aligns with standardized questionnaire responses from online health forums. Mid-sized models achieved the highest accuracy, particularly in zero-shot, post-only settings. Few-shot prompting can further improve the results. Models were also able to generate evidence spans that closely matched human annotations. However, they consistently struggled with ambiguous and semantically overlapping terms. Overall, automated extraction of QoL information from patient-generated content may offer a faster, lower-cost and low-burden complement to traditional QoL questionnaires, given that limitations such as symptom ambiguity are addressed in future work.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.