975 questions
-1
votes
1
answer
94
views
Unsupervised Topic Modeling for Short Event Descriptions
I have a dataset of approximately 750 lines containing quite short texts (less than 150 words each). These are all event descriptions related to a single broad topic (which I cannot specify for ...
0
votes
1
answer
105
views
MiniBatchKMeans BERTopic not returning topics for half of data
I am trying to topic a dataset of tweets. I have around 50 million tweets. Unfortunately, such a large dataset will not fit in ram (even 128GB) due to the embeddings. Therefore, I have been working on ...
0
votes
0
answers
43
views
Calculating Topic Correlations or Coocurrences for keyATM
I have been playing around with the keyATM package extensively, however unfortunately there is no approach how to calculate topic correlations and cooccurences, once the model is calculated. I already ...
0
votes
1
answer
111
views
Correct topics from LDA Sequence Model in Gensim
Python's Gensim package offers a dynamic topic model called LdaSeqModel(). I have run into the same problem as in this issue from the Gensim mailing list (which has not been solved). The problem is ...
1
vote
1
answer
161
views
Inspect all probabilities of BERTopic model
Say I build a BERTopic model using
from bertopic import BERTopic
topic_model = BERTopic(n_gram_range=(1, 1), nr_topics=20)
topics, probs = topic_model.fit_transform(docs)
Inspecting probs gives me ...
0
votes
0
answers
41
views
importing util library failed
i am trying to pip install bertopic command for installing and usng bertopic model, here is my next code :
from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia&...
0
votes
0
answers
97
views
Unhashable type when calling HuggingFace topic model `topic_labels_` function
If I try to follow the topic modeling tutorial at: https://huggingface.co/docs/hub/en/bertopic
The first few lines give me an error:
from bertopic import BERTopic
topic_model = BERTopic.load("...
0
votes
0
answers
58
views
Topic modelling outputs are gender biased?
Has anyone had this issue?
My topic modelling seems to be presenting responses that are very dominated by male respondents.
The volume of responses across three different questions is over 800 in each ...
0
votes
1
answer
67
views
Stopwords problem in text data preprocessing in Python
I want to do topic modeling in Python. For this reason, I used my own stop word list, a stop word list I found on GitHub, and nltk's stop word list to clean the stopwords. However, when I examined the ...
0
votes
0
answers
45
views
Cannot find AIC/BIC of my topic modelling after using "lda.collapsed.gibbs.sampler" in LDA package
I have used "lda.collapsed.gibbs.sampler" to do my topic modelling and LDA visualisation, and now I want to determine which number of models (K) best fits my model. Then I tried to use AIC/...
4
votes
1
answer
510
views
Topic modelling many documents with low memory overhead
I've been working on a topic modelling project using BERTopic 0.16.3, and the preliminary results were promising. However, as the project progressed and the requirements became apparent, I ran into a ...
0
votes
1
answer
47
views
How to extract terms and probabilities from tmResult$terms in topic modeling?
I like to create separate word clouds for each of my 8 topics in an LDA model. I extracted top 40 words across 8 topics - an object of length 320 containing top words and occurrence probabilities.
I ...
0
votes
1
answer
107
views
How is coherence score calculated in Mallet?
I do understand how the diagnostics output shows the coherence values for each topic but my values range between -150 and -600 and other posts that I have seen where Mallet was used show coherence ...
0
votes
1
answer
69
views
Inconsistent Results When Running Python Mallet/Gibb's Sampling as a Soft-Clustering Method to Identify Optimal Number of Topics
Sorry, but I am inexperienced with Mallet and could use some help. I am currently trying to use Mallet as a soft-clustering technique to assign group membership for a given set of terms contained ...
0
votes
1
answer
84
views
R + quanteda + automatic detection of topics: error when running model
I have a set of many (around 20 thousand) short job descriptions in English. My purpose for now is to be able to detect their optimal number of topics.
I use an R script which worked decently on a ...
0
votes
0
answers
56
views
Errors attaching metadata to corpus
I am trying to generate a corpus with two documents: one is responses of participants characterized as "supporters" and one is responses of "non-supporters". I've entered this as ...
0
votes
0
answers
154
views
LDA Error in x$terms %||% attr(x, "terms")
everyone. I can't understand why is giving me an error. Later on, the code was working with no errors. Packages are: quanteda, quanteda.texmodels, quanteda.textstats, quanteda.textplots, newsmap, ...
2
votes
3
answers
94
views
Find matching rows in dataframes based on number of matching items
I have two topic models, topics1 and topics2. They were created from very similar but different datasets. As a result, the words representing each topic/cluster as well as the topic numbers will be ...
1
vote
0
answers
50
views
RStudio stm package Error in makeTopMatrix(prevalence, data)
I am receiving the following error message:
Error in makeTopMatrix(prevalence, data) : Error creating model matrix.
This could be caused by many things including
explicit calls to a namespace within ...
2
votes
0
answers
138
views
R stm package plot custom labels font size
In page 19 of the stm tutorial, Figure 6: Graphical display of topical prevalence contrast
https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf
How to change the font size of the ...
2
votes
0
answers
54
views
Top2Vec model gets stuck on Colab
I'm trying to implement Top2Vec on Colab. The following code is working fine with the dataset "https://raw.githubusercontent.com/wjbmattingly/bap_sent_embedding/main/data/vol7.json" ...
0
votes
1
answer
158
views
How to assign topics to individual documents/ tweets in Bi-term Topic Modeling?
I am a newbie at this, so I apologize if I am asking the obvious here. I ran a bi-term topic modeling algorithm to model short text data and discover topics among them. I am using LDAvis package to ...
0
votes
1
answer
217
views
topic modeling from quotes
Based on the folloiwng link : quotes
with help of following code(this site was based on javascript, so first i have disabled it)
import selenium
from selenium import webdriver
from selenium....
0
votes
1
answer
124
views
stm Structural Topic Model - estimateEffect returns only 10 years
I ran an stm topic model and used estimateEffect:
prep <- estimateEffect(1:20 ~ Party + s(Year), model,
meta = out$meta, uncertainty = "Global")
What is shown in ...
0
votes
1
answer
134
views
ImportError: cannot import name 'remove_stopwords' from partially initialized module 'gensim.parsing.preprocessing'
I have Python 3.12.2 and gensim 4.3.2 but when I tried to use Import gensim in my python code I got the error below:
ImportError Traceback (most recent call last)
Cell In[...
2
votes
1
answer
1k
views
BERTopic: "Make sure that the iterable only contains strings"
I'm still fairly new to Python so this might be easier than it appears to me, but I'm stuck. I'm trying to use BERTopic and visualize the results with PyLDAVis. I want to compare the results with the ...
0
votes
1
answer
1k
views
Trying to transcribe audio files in R
i'm new to R and trying to use a script in order to transcribe audio files.
I found this terrific person, who proposes a solution for audio transcription.
https://www.bnosac.be/index.php/blog/105-...
1
vote
1
answer
1k
views
Summarization and Topic Extraction with LLMs (private) and LangChain or LlamaIndex using flan-t5-small
has anyone used Langchain or LlamaIndex imports to deal with single documents that amount to >512 tokens? Yes, I know there are other approaches to dealing with it, but it is difficult to find ...
2
votes
1
answer
128
views
BERTopic: add legend to term score decline
I plot the term score decline for a topic model I created on Google Colab with BERTopic. Great function. Works neat! But I need to add a legend. This parameter is not specified in the topic_model....
0
votes
1
answer
49
views
Tracing terms in topic models to their full-text version in R
How does one retrieve full-text examples of the terms making up a topic model? The goal is to get to know more context of what the ngram is about, to help assign labels better. To achieve this, the ...
-1
votes
1
answer
1k
views
Bert topic clasiffying over a quarter of documents in outlier topic -1
I am running Bert topic with default options
import pandas as pd
from sentence_transformers import SentenceTransformer
import time
import pickle
from bertopic import BERTopic
llm_mod = "all-...
1
vote
0
answers
50
views
Keyatm covariate model gives me same result of the predicted mean of the document-topic distribution for four different country categories
I have four different groups of countries, I would assume that the predicted mean of the document-topic distribution differs over countries. Yet i get the same results.
I run this code (pretty much in ...
0
votes
1
answer
484
views
R: stm + searchK fails to determine the optimal number of topics
Please have a look at the self-contained example at the end of the post.
I simplified the reprex and you can download the dfm (document-feature matrix) from
https://e.pcloud.link/publink/show?code=...
0
votes
1
answer
401
views
R: Quanteda+LDA, how to Visualise the Results?
Please have a look at the snippet at the end of this post.
I run a simplified tutorial example of topic modeling with quanteda, but once the model has finished running, I find it difficult to extract ...
1
vote
0
answers
223
views
BERTopic Visualization in dark
I want to change the default visualizations within BERTopic to display a dark theme rather than a white or bright theme.
Basically I'm trying to do:
import plotly.io as pio
pio.templates.default ...
0
votes
1
answer
146
views
Referring to "short texts" in topic modelling and natural language processing, what is the definition of the length of a short text?
When it comes to "short texts" in topic modelling and natural language processing, what exactly is the definition of a short text? I have not been able to find a definitive answer. Could ...
0
votes
1
answer
426
views
Long text topic modelling differences
I have some very long documents. They have overall topics that are fairly standard, but each document will emphasise the topics differently AND within those topics they will have different subtopics
I ...
0
votes
1
answer
111
views
Problem with visualizing topics with pyLDAvis
I have a problem by running this code and cant get a visualisation from the topics and words from the LDA model. Anyone who knows how to solve this problem. I get the following warning
"...
-1
votes
2
answers
45
views
How to assign column names?
I am writing a code for Topic modeling. I received this error.
install.packages("tm")
install.packages("topicmodels")
library(tm)
library(topicmodels)
docs <- Corpus(...
2
votes
0
answers
127
views
How to implement TorchDrift's Drift Detection for Monitoring Separate Text Embedding Distributions Across Multiple Topics?
I'm working on a project involving text data with multiple topics, and I want to use the Kernel Maximum Mean Discrepancy (Kernel MMD) for drift detection on text embeddings for each topic separately.
...
1
vote
0
answers
294
views
Plotting a structural topic model - how to allow for discontinuity over time
I am running a structural topic model using the stm package in R. My model includes an interaction effect between faction_id and numeric_date (a measure of time). I am using the following code to ...
1
vote
0
answers
115
views
Python BERTopic 'numpy.float64' object cannot be interpreted as an integer
I am trying to replicate the Topic Modeling exercise from this article titled NLP Tutorial: Topic Modeling in Python with BerTopic. The article comes from the website HackerNoon if you'd prefer to ...
0
votes
1
answer
667
views
Clustering topics and naming the cluster in Python
I have millions of topics in my data. These topics are one to 12 words. For instance 'Cancer Biology and Genetics' could be one topic and 'Regenerative medicines' could be another. I want to create ...
4
votes
3
answers
831
views
Jupyter keeps crashing when using BERTopic's fit_transform()
topics, probs = topic_model.fit_transform(docs)
Whenever I run fit_transform like in the line above, my Jupyter notebook keeps dying, and I don't know why. I am using Python 3.9.15 on a macOS 13.4.1 ...
2
votes
1
answer
85
views
How to import excel file in mallet
I have excel file that contains posts title of stack overflow posts. My excel sheet have more than 10,000 lines. Therefore it is not possible to make separate txt for each row.
If I copy my excel data ...
1
vote
1
answer
782
views
AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names' [duplicate]
I'm trying to visualize the LDA Topics using the pyLDAvis library
I'll be using sklearn.decomposition's LatentDirichletAllocation`
My sklearn's version: 1.2.2
The error:
AttributeError: '...
-1
votes
1
answer
500
views
Integrate GridSearchCV with LDA Gensim
Data Source: Glassdoor reviews split into two dataframe columns "Pros" & Cons"
- Pros refer to what the employees liked about the company
- Cons refer to what the ...
1
vote
1
answer
936
views
Getting an error from hdbscan while importing bertopic
I'm trying to import bertopic but it gives the following error. I tried different versions and re create a new environment. But it's still same. I'm using Apple M2 Pro processor
lib
version
BERTopic
0....
3
votes
1
answer
143
views
Error in posterior function when running LDA
I am trying to conduct topic modelling on a dataset. I follow standard procedure, clean the data, tokenize, create a dtm and apply the LDA function (topics <- tidy(my_topic_model, matrix = "...
1
vote
0
answers
497
views
What if I have too many documents labelled in -1 cluster in bertopic?
I'm generating topics using bertopic on multilingual dataset (mainly Russian and English). I'm reducing the number of topics to 140. After generating topics, I'm analyzing its quality using the ...