2,104 questions
Best practices
1
vote
3
replies
88
views
Deleting large data without stopping the active mysql server
I'm using a table of approximately 1TB in a MySQL database. This table also has a monthly partition. We store the last two months of data in this table and regularly truncate the data from the ...
Best practices
0
votes
3
replies
71
views
I want to make an "HTTPS proxy cache server" with nginx that controls large size of Git source
I'm new to nginx and proxy servers.
We have a problem about googlesource 429 Error, caused by many requests and because of bandwidth, we took a long time to get googlesource.
We reviewed to make AOSP ...
0
votes
1
answer
85
views
How to reference a second Pandas dataframe to the first one without creating any copy of the first one?
I have a large pandas dataframe df of something like a million rows and 100 columns, and I have to create a second dataframe df_n, same size as the first one. Several rows and columns of df_n will be ...
1
vote
1
answer
104
views
pandas read csv with MANY columns
I have a csv that has 162 rows, but around 3 million columns. I want to read it with pandas. I have enough RAM available, but pd.read_csv(file.csv, header=None, dtype=str) takes forever. The cells ...
0
votes
1
answer
133
views
Mat-Autocomplete with cdk-virtual-scroll-viewport (large data) arrow-keys didnt work
I am trying to use mat-autocomplete with large data. For the large data aspect im using a cdk-virtual-scroll-viewport with mat-autocomplete. Everything works except the Arrow-Key Navigation.
<mat-...
2
votes
0
answers
74
views
How can I optimize a PHP script that fetches 1M+ MySQL rows for reporting without exhausting memory? [duplicate]
I'm working on a PHP web application that generates reports from a MySQL table with over 1 million rows.
What I'm trying to do:
Fetch a large dataset and process it to generate a downloadable report (...
6
votes
3
answers
289
views
How to select element from two complex number lists to get minimum magnitude for their sum
I have two python lists(list1 and list2),each containing 51 complex numbers. At each index i, I can choose either list1[i] or list2[i]. I want to select one element per index(Total of 51 elements) ...
1
vote
1
answer
51
views
Date Range Large Index/Match Duplicate
The results area is finding the largest top 4 costs in column A within the date range =IFERROR(LARGE(IF(Sheet1!$D$5:$D$4935>=$A$2,IF(Sheet1!$D$5:$D$20<=$B$2,Sheet1!$E$5:$E$20)),1),0)and then ...
0
votes
1
answer
82
views
work with large matrices to build phylogenetic trees
Following this guide, I managed to convert my IBS matrix into a phylo object. Now. everything works just fine for small test cases.
However, I'm now trying to scale to the entire dataset of 300 ...
5
votes
1
answer
259
views
LOESS on very large dataset
I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS ...
0
votes
1
answer
55
views
What is the problem when I am trying to remove "Don't know/refuse" level in SAS
I am trying to remove the "Don't know/refuse" for the headache and breasttenderness variable but all the values for the breasttenderness_num and headache_num are missing. Below is the code ...
-3
votes
2
answers
82
views
Identify data groups in irregular row groups in a dataframe in R
I want to know how many groups of data this df has:
df <- data.frame(
stringsAsFactors = FALSE,
V1 = c("A","-","-","-","B"...
0
votes
0
answers
59
views
How can I optimize and apply this same logic for large datasets?
I’m working on a system where I calculate the similarity between user vectors and product vectors using cosine similarity in Python with NumPy. The code below performs the necessary operations, but I ...
1
vote
0
answers
84
views
Larger-than-memory Survey Analysis with R+Arrow
I'm currently trying to analyze data from the National Inpatient Sample (NIS). When combining multiple years worth of data, my files are just over 8 GB after processing/selecting relevant columns. I ...
1
vote
1
answer
74
views
How to improve responsiveness of interactive plotly generated line plots saved as html files?
I have some very long time series data (millions of data points) and generate interactive plotly html plots based on this data. I am using Scattergl from plotly's graphical_obects.
When I attempt to ...
1
vote
1
answer
256
views
How can I efficiently handle filtering and processing large datasets in Python with limited memory?
I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several ...
0
votes
1
answer
316
views
How to make my frequency table into a heatmap in R
I have converted a large dataset into a two-way frequency table and want to present it in a heatmap graph, with the colours representing the frequency. I've managed to make a heatmap, but it only ...
3
votes
4
answers
109
views
Analysing several columns of a dataset at the same time
I work with a real large dataset, where it is very difficult to look at all columns individually.
At this time I only want to count the frequency of the information provided.
Lets say I have a ...
1
vote
1
answer
59
views
strange behavior of lag() in R
I am using a code to filter out a smaller dataset from a larger dataset. I am selecting children under the age of 24 months and another variable (b9) which says if child is living with mother or not.
...
-1
votes
1
answer
60
views
What is the most efficient way to sort a large dataset in Python? [closed]
I have a large dataset (millions of entries) that needs to be sorted. What are the best practices or most efficient methods for sorting such a dataset in Python? Specifically:
Is Python's built-in ...
0
votes
0
answers
38
views
What object should I store many strings in? [duplicate]
I have a process where I must (de)tokenize payment account information. Currently, when sending a file to client with payment information (among other things), the actual payment account numbers are ...
1
vote
0
answers
30
views
3d graphs of large data sets don't render
I'm trying to create 3d graphs of large data sets(~10 million data points), but for some reason the graph won't render in plotly. To generate some sample data, I use:
import numpy as np
COUNT = ...
1
vote
1
answer
58
views
How to apply a condition over a vector in R
i need to a particular type of substitution, infact i want to replace some blanks (" " characters) in a data frame with a random choice on the same column, given a certain condition (for ...
0
votes
1
answer
99
views
What's the smallest way to store an HTML canvas image?
I'm trying to make a pixel art animator thing where you can make pixel art but also animate it but the problem is I want the canvas to take up most of my screen with just a little bit of space below ...
1
vote
0
answers
590
views
GPU running out of memory when trying to load a large pretrained model
I am using Hugging face to load some pretrained models to do some testing on some data.
My code looks like this:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #Tried to mitigate out of memory ...
1
vote
2
answers
128
views
How can I efficiently filter and aggregate data in a Pandas DataFrame with multiple conditions?
I have a large Pandas DataFrame with multiple columns, including Category, SubCategory, Value, and Date. I need to filter this DataFrame based on multiple conditions and then aggregate the filtered ...
2
votes
0
answers
60
views
How to handle extremely large matrix's operations?
I need to handle extremely large matrix (larger than (150K, 150k)) and use these matrices to do matrix operations (mainly matrix multiplications and calculate the inverse of matrix). This process ...
2
votes
4
answers
148
views
How can I improve this for loop to index specific lines of a vector with a large dataset
I apologize if this is formatted incorrectly or if I am missing any information that would be helpful. I am attempting to run a for loop with a nested if statement for a couple of large datasets. The ...
0
votes
0
answers
84
views
Memory Saturation and Kernel Restart when Converting to np.array in Python
I'm encountering a memory issue when converting my processed data to a numpy array. I have 57GB of RAM, but the RAM saturates quickly and the kernel restarts at np.array(processed_X). Here is my code:
...
1
vote
0
answers
122
views
Evaluation Speed is too low, and takes alot of time using HF trainer
I'm training a huge self-supervised model, when I tried to train the complete dataset, it threw cuda oom errors, to fix that I decreased batch size and added gradiant accumulation along with eval ...
3
votes
1
answer
787
views
How to use {fmt} with large data
I'm starting to play with {fmt} and wrote a little program to see how it processes large containers. It would seem that fmt::print() (which ultimately sends output to stdout) internally first ...
-1
votes
1
answer
50
views
Sum Function in MS Excel [closed]
[=IF(I74<=50000,"150",IF(I74<=100000,"200",IF(I74<=150000,"250",IF(I74<=200000,"300",IF(I74<=250000,"350",IF(I74<=300000,"400&...
2
votes
3
answers
188
views
Python Multiprocessing: when I launch many processes on a huge pandas data frame, the program gets stuck
I am trying to gain execution time with python's multiprocessing library (pool_starmap) on a code that executes the same task in parallel, on the same Pandas DataFrame, but with different call ...
0
votes
1
answer
200
views
Converting a very large (250GB+) json file into csv using ijson in Python
I'm trying to convert an extremely large (over 250GB) json file into a csv; the json file looks like this:
{
"BuildingSiteList":[
{
"ID": "00001"
(34 more ...
0
votes
0
answers
323
views
Fuzzy match on large dataset- speed up and keep similar matches
I have two beer datasets; one is ~3 million entries, and another is 175 thousand entries. Doing a fuzzy match on these two will take way too long. I've run a few tests on the same 1000 random sample ...
-3
votes
1
answer
64
views
What is the quickest way to group and analyze large data (~150MM+ rows)?
I have a large dataset of historical power prices (151mm+). There are 18,065 individual nodes where prices settle, each with hourly observations (8760/yr).
Data schema: Node ID (int64), Datetime (...
2
votes
1
answer
121
views
how to filter huge csv file by pandas
I have a 10GB csv file data/history_{date_to_be_searched}.csv. it has more than 27000 zip codes. On the basis of zip code I have to filter the csv file then each filtered file I have to upload to ...
1
vote
2
answers
215
views
Memory efficient parallel repeated rarefaction with subsequent matrix addition of large data set
I am trying to speed up repeatedly rarefying a data frame and the subsequent addition of generated matrices. Some background information: The data set I want to repeatedly rarefy is very large (about ...
0
votes
1
answer
124
views
How should very large but highly symmetric arrays be handled in Python? [closed]
I am trying to populate and store a NumPy array with ~1 trillion entries with data to be retrieved later. The array has ~50 dimensions with ~7 indices, i.e. it is a rank-7 tensor in 50 dimensions or ...
0
votes
2
answers
423
views
Powershell Script to Replace Text in Text File, but not save to new file
I am trying to replace text in a large text file, 5gb. I found the script below. It outputs to a new file.
powershell -Command "(gc myFile.txt) -replace 'foo', 'bar' | Out-File -encoding ASCII ...
0
votes
1
answer
100
views
Logistic Lasso on large gene dataset specifically through the Knockoff package in R
This question is perhaps in an uncanny valley between CrossValidated and StackOverflow, as I'm trying to understand the methodology of functions in an R package, in the context of executing them ...
0
votes
0
answers
93
views
R: efficient and fast splitting large data files in a directory by a variable and write out the files
I have come into a problem regarding how to fast and efficient read and split a list of very large transaction data files by a column called SecurityID, inside each transaction data file, there can be ...
0
votes
1
answer
207
views
Trying to stream my (very large) json file with ijson - is it formatted wrong?
I'm trying to stream through a large json file using ijson in python. This is my first time trying this.
my code is really simple right now:
with open('file.json', 'rb') as f:
j = ijson.items(f, 'item'...
-1
votes
1
answer
225
views
How to Efficiently Manage Large Datasets in Select2 with AJAX and Laravel
I'm working on a Laravel application that requires dynamic loading of select options in the UI, potentially dealing with large datasets. The goal is to implement autocomplete functionality where ...
-1
votes
2
answers
102
views
High performing dataframe join in Python
I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes ...
2
votes
1
answer
3k
views
How to randomly sample very large pyArrow dataset
I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot ...
1
vote
1
answer
102
views
Bayesian Network [Variable Elimination]: merge and groupby memory crash using pandas
Tried to speed up my functions and make them more memory efficient for variable elimination algorithm on Bayesian Network but it will still crash once the dataframe gets too big.
I have created a ...
0
votes
0
answers
458
views
How to merge large PDF files without running of memory on node js
I am using node js and pupeeter to generate a lot of pdf files with high resolution images on them. After that all generated pdf files i store in array. Then i merge them one by one using pdf-merger-...
0
votes
1
answer
344
views
creating unique ID column in a large dataset
How to create a column for unique IDs replacing the old unique IDs in a large dataset, as large as around 26000 observations?
I have a dataset with 26000 observations and need to create a unique ID ...
0
votes
1
answer
498
views
Getting No space left on device (28) for resource intensive php scripts
I have two PHP scripts, the parent script has a loop of 1000000 and in each loop it calls another child script with the help of shell_exec(). The child script performs 10000 insertions in mysql table ...