Newest 'large-data' Questions

Best practices

1 vote

3 replies

88 views

Deleting large data without stopping the active mysql server

I'm using a table of approximately 1TB in a MySQL database. This table also has a monthly partition. We store the last two months of data in this table and regularly truncate the data from the ...

user1592429

21

asked Nov 23 at 10:02

Best practices

0 votes

3 replies

71 views

I want to make an "HTTPS proxy cache server" with nginx that controls large size of Git source

I'm new to nginx and proxy servers. We have a problem about googlesource 429 Error, caused by many requests and because of bandwidth, we took a long time to get googlesource. We reviewed to make AOSP ...

SangHyuk Kwon

35

asked Nov 19 at 8:21

0 votes

1 answer

85 views

How to reference a second Pandas dataframe to the first one without creating any copy of the first one?

I have a large pandas dataframe df of something like a million rows and 100 columns, and I have to create a second dataframe df_n, same size as the first one. Several rows and columns of df_n will be ...

MBlrd

165

asked Aug 8 at 20:19

1 vote

1 answer

104 views

pandas read csv with MANY columns

I have a csv that has 162 rows, but around 3 million columns. I want to read it with pandas. I have enough RAM available, but pd.read_csv(file.csv, header=None, dtype=str) takes forever. The cells ...

emilp

25

asked Jun 5 at 15:45

0 votes

1 answer

133 views

Mat-Autocomplete with cdk-virtual-scroll-viewport (large data) arrow-keys didnt work

I am trying to use mat-autocomplete with large data. For the large data aspect im using a cdk-virtual-scroll-viewport with mat-autocomplete. Everything works except the Arrow-Key Navigation. <mat-...

Fabian

3

asked Jun 4 at 7:33

2 votes

0 answers

74 views

How can I optimize a PHP script that fetches 1M+ MySQL rows for reporting without exhausting memory? [duplicate]

I'm working on a PHP web application that generates reports from a MySQL table with over 1 million rows. What I'm trying to do: Fetch a large dataset and process it to generate a downloadable report (...

Rita Williams

21

asked May 3 at 7:38

6 votes

3 answers

289 views

How to select element from two complex number lists to get minimum magnitude for their sum

I have two python lists(list1 and list2),each containing 51 complex numbers. At each index i, I can choose either list1[i] or list2[i]. I want to select one element per index(Total of 51 elements) ...

Manish Tr

75

asked Apr 18 at 10:47

1 vote

1 answer

51 views

Date Range Large Index/Match Duplicate

The results area is finding the largest top 4 costs in column A within the date range =IFERROR(LARGE(IF(Sheet1!$D$5:$D$4935>=$A$2,IF(Sheet1!$D$5:$D$20<=$B$2,Sheet1!$E$5:$E$20)),1),0)and then ...

mjac

221

asked Mar 21 at 15:25

0 votes

1 answer

82 views

work with large matrices to build phylogenetic trees

Following this guide, I managed to convert my IBS matrix into a phylo object. Now. everything works just fine for small test cases. However, I'm now trying to scale to the entire dataset of 300 ...

Matteo

537

asked Mar 16 at 20:00

5 votes

1 answer

259 views

LOESS on very large dataset

I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS ...

Shunrei

339

asked Mar 16 at 14:43

0 votes

1 answer

55 views

What is the problem when I am trying to remove "Don't know/refuse" level in SAS

I am trying to remove the "Don't know/refuse" for the headache and breasttenderness variable but all the values for the breasttenderness_num and headache_num are missing. Below is the code ...

John Mathews

73

asked Feb 2 at 21:41

-3 votes

2 answers

82 views

Identify data groups in irregular row groups in a dataframe in R

I want to know how many groups of data this df has: df <- data.frame( stringsAsFactors = FALSE, V1 = c("A","-","-","-","B"...

noriega

448

asked Jan 24 at 16:12

0 votes

0 answers

59 views

How can I optimize and apply this same logic for large datasets?

I’m working on a system where I calculate the similarity between user vectors and product vectors using cosine similarity in Python with NumPy. The code below performs the necessary operations, but I ...

SanchoH

1

asked Jan 7 at 23:32

1 vote

0 answers

84 views

Larger-than-memory Survey Analysis with R+Arrow

I'm currently trying to analyze data from the National Inpatient Sample (NIS). When combining multiple years worth of data, my files are just over 8 GB after processing/selecting relevant columns. I ...

Eli

337

asked Jan 1 at 18:00

1 vote

1 answer

74 views

How to improve responsiveness of interactive plotly generated line plots saved as html files?

I have some very long time series data (millions of data points) and generate interactive plotly html plots based on this data. I am using Scattergl from plotly's graphical_obects. When I attempt to ...

rhz

1,172

asked Dec 25, 2024 at 19:16

1 vote

1 answer

256 views

How can I efficiently handle filtering and processing large datasets in Python with limited memory?

I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several ...

user20603914

asked Dec 19, 2024 at 19:14

0 votes

1 answer

316 views

How to make my frequency table into a heatmap in R

I have converted a large dataset into a two-way frequency table and want to present it in a heatmap graph, with the colours representing the frequency. I've managed to make a heatmap, but it only ...

Asha Marshall

1

asked Dec 13, 2024 at 9:42

3 votes

4 answers

109 views

Analysing several columns of a dataset at the same time

I work with a real large dataset, where it is very difficult to look at all columns individually. At this time I only want to count the frequency of the information provided. Lets say I have a ...

USER12345

145

asked Dec 6, 2024 at 13:38

1 vote

1 answer

59 views

strange behavior of lag() in R

I am using a code to filter out a smaller dataset from a larger dataset. I am selecting children under the age of 24 months and another variable (b9) which says if child is living with mother or not. ...

Ultrainstinct

23

asked Nov 27, 2024 at 15:09

-1 votes

1 answer

60 views

What is the most efficient way to sort a large dataset in Python? [closed]

I have a large dataset (millions of entries) that needs to be sorted. What are the best practices or most efficient methods for sorting such a dataset in Python? Specifically: Is Python's built-in ...

Bibek Bayan

1

asked Nov 21, 2024 at 14:05

0 votes

0 answers

38 views

What object should I store many strings in? [duplicate]

I have a process where I must (de)tokenize payment account information. Currently, when sending a file to client with payment information (among other things), the actual payment account numbers are ...

George

2,212

asked Nov 6, 2024 at 23:37

1 vote

0 answers

30 views

3d graphs of large data sets don't render

I'm trying to create 3d graphs of large data sets(~10 million data points), but for some reason the graph won't render in plotly. To generate some sample data, I use: import numpy as np COUNT = ...

import huh

103

asked Oct 3, 2024 at 17:16

1 vote

1 answer

58 views

How to apply a condition over a vector in R

i need to a particular type of substitution, infact i want to replace some blanks (" " characters) in a data frame with a random choice on the same column, given a certain condition (for ...

Iacopo

11

asked Sep 16, 2024 at 21:14

0 votes

1 answer

99 views

What's the smallest way to store an HTML canvas image?

I'm trying to make a pixel art animator thing where you can make pixel art but also animate it but the problem is I want the canvas to take up most of my screen with just a little bit of space below ...

user21434395

1

asked Sep 9, 2024 at 14:41

1 vote

0 answers

590 views

GPU running out of memory when trying to load a large pretrained model

I am using Hugging face to load some pretrained models to do some testing on some data. My code looks like this: import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' #Tried to mitigate out of memory ...

NinaNuska

23

asked Aug 16, 2024 at 8:39

1 vote

2 answers

128 views

How can I efficiently filter and aggregate data in a Pandas DataFrame with multiple conditions?

I have a large Pandas DataFrame with multiple columns, including Category, SubCategory, Value, and Date. I need to filter this DataFrame based on multiple conditions and then aggregate the filtered ...

Harold

47

asked Jul 15, 2024 at 20:58

2 votes

0 answers

60 views

How to handle extremely large matrix's operations?

I need to handle extremely large matrix (larger than (150K, 150k)) and use these matrices to do matrix operations (mainly matrix multiplications and calculate the inverse of matrix). This process ...

Zheng YANG

21

asked Jul 2, 2024 at 19:12

2 votes

4 answers

148 views

How can I improve this for loop to index specific lines of a vector with a large dataset

I apologize if this is formatted incorrectly or if I am missing any information that would be helpful. I am attempting to run a for loop with a nested if statement for a couple of large datasets. The ...

TCB at EU

131

asked Jun 10, 2024 at 20:39

0 votes

0 answers

84 views

Memory Saturation and Kernel Restart when Converting to np.array in Python

I'm encountering a memory issue when converting my processed data to a numpy array. I have 57GB of RAM, but the RAM saturates quickly and the kernel restarts at np.array(processed_X). Here is my code: ...

houfis

1

asked May 30, 2024 at 7:49

1 vote

0 answers

122 views

Evaluation Speed is too low, and takes alot of time using HF trainer

I'm training a huge self-supervised model, when I tried to train the complete dataset, it threw cuda oom errors, to fix that I decreased batch size and added gradiant accumulation along with eval ...

Shreyas S

11

asked May 25, 2024 at 20:29

3 votes

1 answer

787 views

How to use {fmt} with large data

I'm starting to play with {fmt} and wrote a little program to see how it processes large containers. It would seem that fmt::print() (which ultimately sends output to stdout) internally first ...

Matthew Busche

694

asked May 9, 2024 at 0:10

-1 votes

1 answer

50 views

Sum Function in MS Excel [closed]

[=IF(I74<=50000,"150",IF(I74<=100000,"200",IF(I74<=150000,"250",IF(I74<=200000,"300",IF(I74<=250000,"350",IF(I74<=300000,"400&...

user24664789

11

asked Apr 25, 2024 at 7:05

2 votes

3 answers

188 views

Python Multiprocessing: when I launch many processes on a huge pandas data frame, the program gets stuck

I am trying to gain execution time with python's multiprocessing library (pool_starmap) on a code that executes the same task in parallel, on the same Pandas DataFrame, but with different call ...

Lyreck

21

asked Apr 21, 2024 at 21:33

0 votes

1 answer

200 views

Converting a very large (250GB+) json file into csv using ijson in Python

I'm trying to convert an extremely large (over 250GB) json file into a csv; the json file looks like this: { "BuildingSiteList":[ { "ID": "00001" (34 more ...

user24506501

1

asked Apr 19, 2024 at 17:59

0 votes

0 answers

323 views

Fuzzy match on large dataset- speed up and keep similar matches

I have two beer datasets; one is ~3 million entries, and another is 175 thousand entries. Doing a fuzzy match on these two will take way too long. I've run a few tests on the same 1000 random sample ...

nick kalra

1

asked Apr 17, 2024 at 15:52

-3 votes

1 answer

64 views

What is the quickest way to group and analyze large data (~150MM+ rows)?

I have a large dataset of historical power prices (151mm+). There are 18,065 individual nodes where prices settle, each with hourly observations (8760/yr). Data schema: Node ID (int64), Datetime (...

kblackburn

1

asked Apr 16, 2024 at 17:00

2 votes

1 answer

121 views

how to filter huge csv file by pandas

I have a 10GB csv file data/history_{date_to_be_searched}.csv. it has more than 27000 zip codes. On the basis of zip code I have to filter the csv file then each filtered file I have to upload to ...

zircon

930

asked Apr 6, 2024 at 17:27

1 vote

2 answers

215 views

Memory efficient parallel repeated rarefaction with subsequent matrix addition of large data set

I am trying to speed up repeatedly rarefying a data frame and the subsequent addition of generated matrices. Some background information: The data set I want to repeatedly rarefy is very large (about ...

user13774123

asked Mar 29, 2024 at 11:13

0 votes

1 answer

124 views

How should very large but highly symmetric arrays be handled in Python? [closed]

I am trying to populate and store a NumPy array with ~1 trillion entries with data to be retrieved later. The array has ~50 dimensions with ~7 indices, i.e. it is a rank-7 tensor in 50 dimensions or ...

Geoffrey

109

asked Mar 23, 2024 at 23:40

0 votes

2 answers

423 views

Powershell Script to Replace Text in Text File, but not save to new file

I am trying to replace text in a large text file, 5gb. I found the script below. It outputs to a new file. powershell -Command "(gc myFile.txt) -replace 'foo', 'bar' | Out-File -encoding ASCII ...

Peter Sun

1,953

asked Mar 21, 2024 at 2:15

0 votes

1 answer

100 views

Logistic Lasso on large gene dataset specifically through the Knockoff package in R

This question is perhaps in an uncanny valley between CrossValidated and StackOverflow, as I'm trying to understand the methodology of functions in an R package, in the context of executing them ...

purpleblade98

97

asked Mar 15, 2024 at 14:32

0 votes

0 answers

93 views

R: efficient and fast splitting large data files in a directory by a variable and write out the files

I have come into a problem regarding how to fast and efficient read and split a list of very large transaction data files by a column called SecurityID, inside each transaction data file, there can be ...

ML33M

415

asked Mar 14, 2024 at 0:05

0 votes

1 answer

207 views

Trying to stream my (very large) json file with ijson - is it formatted wrong?

I'm trying to stream through a large json file using ijson in python. This is my first time trying this. my code is really simple right now: with open('file.json', 'rb') as f: j = ijson.items(f, 'item'...

Tim Vowden

69

asked Mar 11, 2024 at 12:11

-1 votes

1 answer

225 views

How to Efficiently Manage Large Datasets in Select2 with AJAX and Laravel

I'm working on a Laravel application that requires dynamic loading of select options in the UI, potentially dealing with large datasets. The goal is to implement autocomplete functionality where ...

Rashid Ali Mughal

1

asked Feb 20, 2024 at 11:27

-1 votes

2 answers

102 views

High performing dataframe join in Python

I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes ...

Pijush

31

asked Feb 16, 2024 at 7:34

2 votes

1 answer

3k views

How to randomly sample very large pyArrow dataset

I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot ...

youtube

504

asked Feb 16, 2024 at 6:09

1 vote

1 answer

102 views

Bayesian Network [Variable Elimination]: merge and groupby memory crash using pandas

Tried to speed up my functions and make them more memory efficient for variable elimination algorithm on Bayesian Network but it will still crash once the dataframe gets too big. I have created a ...

user23405367

11

asked Feb 14, 2024 at 15:06

0 votes

0 answers

458 views

How to merge large PDF files without running of memory on node js

I am using node js and pupeeter to generate a lot of pdf files with high resolution images on them. After that all generated pdf files i store in array. Then i merge them one by one using pdf-merger-...

Marat Tazhiev

11

asked Feb 6, 2024 at 8:22

0 votes

1 answer

344 views

creating unique ID column in a large dataset

How to create a column for unique IDs replacing the old unique IDs in a large dataset, as large as around 26000 observations? I have a dataset with 26000 observations and need to create a unique ID ...

Sha D.

9

asked Feb 4, 2024 at 23:54

0 votes

1 answer

498 views

Getting No space left on device (28) for resource intensive php scripts

I have two PHP scripts, the parent script has a loop of 1000000 and in each loop it calls another child script with the help of shell_exec(). The child script performs 10000 insertions in mysql table ...

Paarth Pratim Mout

48

asked Feb 3, 2024 at 6:48

Collectives™ on Stack Overflow