Newest 'data-processing' Questions

0 votes

0 answers

84 views

How do I process CSV files from S3 using pandas without loading the entire file?

My current data processing flow looks like this Load CSV Pivot Data Filter the original data based on some results from previous step Repeat several times I have this working on several CSV files in ...

BBloggsbott

430

asked Jul 3 at 9:10

0 votes

1 answer

39 views

How can I parse array-style string data that isn't in standard JSON format?

I have data in an array-style string format that I need to parse and insert into a DolphinDB table, with each nested array element becoming a separate record. Here's an example of the data: '[["...

Stella.W

27

asked Jun 30 at 2:51

0 votes

1 answer

30 views

Achieving Grouping to Make the Sum of Each Group's Data Equal

I have a dataset with two columns: sID and sum_count. Now, I need to divide the sID into 5 groups with the requirement that: The sum of the sum_count column in each group should be as equal as ...

Huang WeiFeng

95

asked Jun 3 at 9:23

1 vote

1 answer

91 views

Can I split names in a Google Sheet every time there is a new submission?

I am using the following script to split names with a Google Sheet that is receiving submissions from a Squarespace RSVP form. function split() { const DELIMITER = " "; var ss = ...

Christy Perez

11

asked May 15 at 18:00

0 votes

1 answer

56 views

Azure Blob CSV Appending Data Instead of Overwriting for Each Patient Processing using python

I have a Python script that processes person data and appends the results to an Azure Blob Storage CSV file. However, the issue is that for each new patient the generated csv is appending to the ...

krishna sai

1

asked Mar 25 at 16:54

3 votes

2 answers

94 views

How can I split this data into rows in a data frame with column names with pandas?

Each row of my data looks something like this: 8,0 0 1 0.000000000 8082 A WS 24664872 + 8 <- (8,2) 23604576 I'd like to split the data into columns like this: col1 col2 col3 ...

kai

31

asked Nov 8, 2024 at 15:25

0 votes

2 answers

88 views

How to Extract data from rows data collected from logs

Seeking help on how to extract data from rows of data similar to this Raw Data and convert data placement to this Process Data im having problem to extract "Process X" and populate the ...

noobita

1

asked Sep 24, 2024 at 10:48

0 votes

1 answer

236 views

np.load fails with ValueError: cannot reshape array of size (838715,) into shape (838710,)

I'm trying to save the scaling parameters of a dataset into a .npy file on the disk, so I avoid having to recalculate them every time I re-run the code. For now, I'm using MaxAbsScaler() from sklearn ...

geani

11

asked Aug 15, 2024 at 17:36

0 votes

1 answer

68 views

How to monitor a combination of io sensors in a stateful manner

My data source emits IOT data with the following structure - io_id,value,timestamp 232,1223,1718191205 321,671,1718191254 54,2313,1718191275 232,432,1718191315 321,983,1718191394 ........ There are 2 ...

GrozaFry

51

asked Jun 14, 2024 at 10:03

0 votes

1 answer

160 views

JSON Data Stored as Null Values in Delta Lake Table Using PySpark

I encountered an issue while trying to store JSON data as a Delta Lake table using PySpark and Delta Lake. Here's my code: from pyspark.sql import SparkSession from pyspark.sql.types import StructType,...

NO2 SIIZEXL

29

asked Jun 7, 2024 at 5:28

-4 votes

1 answer

64 views

Is there an excel function to assign a binary result to a predefine data cell?

Sorry for the title, I know it might be pretty wide and not so much informative. I am facing a problem regarding the analysis of a data set. The participants of my experiments were randomly assigned ...

taboulet

1

asked May 20, 2024 at 14:53

1 vote

1 answer

168 views

VBA code for processing data in an Excel file crashes after processing about 400-500 rows

I have coded a VBA macro to process downloaded data. The data has some junk rows that need to be deleted, but also has some rows where the data is off by a couple of columns and a couple of rows. The ...

czw

872

asked May 19, 2024 at 16:59

1 vote

0 answers

59 views

pre-processing sequences for LSTM model (sign language recognition)

ive been working on a sign language recognition. i extracted landmarks with mediapipe, saved it as .parquets then padded the data to create uniform length. each row of landmark has 21 node with x,y,z ...

karesosis

11

asked May 17, 2024 at 8:47

0 votes

0 answers

206 views

Fanout use case in Apache Flink

This is an opinion based question: My use case is real-time and needs to be able to process everything in a sub-second speed. I have an external mongo DB which holds information about all of the users ...

Or Keren

138

asked May 10, 2024 at 7:22

0 votes

1 answer

77 views

In NodeJS data is not receiving as chunks (stream) in consumer

I having a two services producer and consumer. Producer has big json file in the server. I want to serve over the network through rest api and I used nodejs stream technique to load bytes in memory ...

Faizul Ahemed

66

asked May 4, 2024 at 5:42

0 votes

1 answer

56 views

Need Excel Macro to duplicate row per "X" marked in column (VBA)

Need a macro that will help me process data where they add an X to mark which group the row belongs to. For example: The data comes with many more columns but that's just the gist of it. They mark ...

user16201107

5

asked Apr 23, 2024 at 17:53

-1 votes

1 answer

47 views

Is there a function in R to reduce the number of zeros of a percentage in the y axis of a bar chart? [closed]

I wrote the code ggplot(data = summary_datas) + geom_bar(mapping = aes(x=member_casual,fill=member_casual)) + labs(title = "Rider Membership data", subtitle= "Difference in the ...

Shasha

1

asked Apr 11, 2024 at 20:48

1 vote

1 answer

675 views

Writing Waveform data into CSV file in LabVIEW

I have a LabVIEW program which contains voltage, current and power data into the same waveform. I am planning to extract each of them one by one and putting into array. Currently, I have extracted ...

Nh K

13

asked Mar 31, 2024 at 1:09

1 vote

1 answer

88 views

How do I ensure that the user continues where they left off in the application in Flutter?

I made a book reading page with the Page_Flip widget, but when the user leaves the application and re-enters the book page, I want it to continue from where it left off. How can I keep the user's ...

Muhammed Halil Demirci

15

asked Mar 8, 2024 at 5:39

1 vote

1 answer

131 views

How to synchronize datastreams in broadcastProcessFunction flink?

im using flink 1.81.1 api on java 11 and im trying to use a BroadcastProcessFunction to filter a Products Datastream with a brand autorized Datastream as broadcast. So my first products Datastream ...

Nabil Hadji

9

asked Mar 7, 2024 at 12:31

-1 votes

1 answer

63 views

Messy CSV auto header extractor [closed]

I have a bunch (100+) CSV files. Each of them can have blank rows, or rows I don't need (Some fuzz info like "Congrats, you all bla bla"). When reading in Pandas I need to specify which row ...

Yewgen_Dom

100

asked Mar 5, 2024 at 22:20

0 votes

1 answer

129 views

Read all rows with pd.read_csv

I'm using Python to read a file of 5,000,000 rows but currently it only reads 1,000,000 rows. The file is around 125mb. I'm using the pd.read_csv function but this only leads to reading 1,000,000 rows ...

Nanhe Zou

1

asked Mar 4, 2024 at 9:11

1 vote

1 answer

168 views

VBA Vlookup Function from Multiple Workbooks

I using a Vlookup function to bring over data from 4 different files into one sheet. I want to place the vlookup results from the 1st file in Column 4, then the results of the 2nd file in Column 6 ...

Eriknme

37

asked Feb 21, 2024 at 5:48

-1 votes

2 answers

102 views

High performing dataframe join in Python

I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes ...

Pijush

31

asked Feb 16, 2024 at 7:34

0 votes

0 answers

34 views

Solution for processing hierarchical structure with large number of leaf nodes in SQL

I'm working on a project which stores data of a tree-structured models like file systems and so on. And in many cases the tree has large number of leaves in it and have unknown depth. My project is ...

pooriya

3

asked Feb 14, 2024 at 15:56

1 vote

1 answer

64 views

A lightweight approach to processing Django Queryset data

I am looking for a optimal way to perform simple data processing from Django Queryset. I would like to not need to install libraries with high volumes like Pandas or numpy. The number of rows in ...

Jacek

73

asked Feb 13, 2024 at 8:11

0 votes

1 answer

919 views

How to Optimize Memory Usage When Processing Large CSV Files in Python?

I am working on a Python script to process large CSV files (ranging from 2GB to 10GB) and am encountering significant memory usage issues. The script reads a CSV file, performs various transformations ...

Shahnoor

3

asked Feb 2, 2024 at 7:01

-1 votes

1 answer

124 views

Pandas read_fwf doesn't read the last digital each row [closed]

I have a .rpt file that has two columns, like this: A column B column 990.E-03 -2.73654E-03 995.E-03 -2.75347E-03 1. ...

hz z

1

asked Feb 1, 2024 at 19:05

0 votes

1 answer

58 views

how to convert excel sheet to data processing using pandas?

enter image description herehow to convert this excel to data processing using pandas import pandas as pd df = pd.read_excel(r"c:/Users/vpullabh/Desktop/Meraci.Ec-NGIOSD.xlsx", sheet_name=&...

Vaishnavi Pullabhatla

1

asked Jan 22, 2024 at 5:22

4 votes

1 answer

295 views

What's the time complexity of forward filling and backward filling in spark?

My question: Need to understand the time complexity of dynamic forward filling and back filling in spark Hello, I have a scala job that reads Delta Table A, transforms Data Frame and writes to Delta ...

Yun Xing

85

asked Dec 13, 2023 at 18:01

-1 votes

1 answer

2k views

Excel in Large-Scale Data Processing with GPT-3.5 and Embeddings

I'm working on integrating OpenAI functionalities, specifically GPT3.5 and embeddings, into a large system of Excel workbooks used for almost anything in my office. Our goal is having GPT3.5 taking ...

Pakoco

41

asked Nov 25, 2023 at 19:07

1 vote

0 answers

89 views

Merging two files and expanding metadata efficiently

I'm dealing with a large file with each row with CHR and POS values (which are positional coordinates). I process this file using a tool, but it outputs only a subset of these positional coordinates ...

binf-er

11

asked Nov 9, 2023 at 22:38

1 vote

1 answer

661 views

Google Cloud Dataflow Job failed: Found unexpected parameters

FAILED NOTE When I set up a Dataflow Pipeline and created a Job from template ('Text Files on Cloud Storage to BigQuery'), I meet this problem. Job creation failed: The workflow could not be created. ...

MING

11

asked Nov 7, 2023 at 14:59

1 vote

0 answers

344 views

Contour detection based on 4-connectivity using `findContours()` from OpenCV

the findContours() function from the OpenCV library does not allow you to customize the selection of contours based on 4-connectivity. I checked on a test image: all the modes of this function that ...

Walrus

23

asked Oct 17, 2023 at 19:39

1 vote

0 answers

75 views

Parsing nested JSON into R List

I have a pretty straight forward JSON object that I am trying to parse into a list of objects for downstream processing and use. The JSON structure is dynamic but here is an example of the structure I ...

James Peruggia

327

asked Oct 10, 2023 at 18:23

1 vote

2 answers

101 views

Process python dictionary based on previous, current and next value

I have a python dictionary as follows: ip_dict = {'GLArch': {'GLArch-0.png': ['OTHER', 'Figure 28 TAC '], 'GLArch-1.png': ['DCDFP', 'This insurance '], '...

spectre

787

asked Oct 10, 2023 at 7:55

0 votes

1 answer

168 views

Not quite understand a concept in Kimball's dimensional modeling

I have read through the idea "Behavior Tag Time Series" several times but couldn't understand Here is the explanation in the book, but still not make sense: "Almost all text in a data ...

cloudscomputes

1,514

asked Oct 3, 2023 at 6:25

0 votes

2 answers

49 views

How to find two threes in zip code using r

I need help with this task: Print data for locations that have two threes in the address.zip code. I tried: filtered_data <- df %>% filter(grepl("\\d{3}.*\\d{3}", address.zip)) ...

Rokas

13

asked Sep 11, 2023 at 18:30

-2 votes

1 answer

239 views

How can I correct my Time Series LSTM RNN for Binary Classification favoring Class 0?

I am attempting to predict a binary outcome based on 15 continuous sequences (except one which isn't a continuous line, but still a sequence). The dataset contains 933k datapoints for all 15 features ...

Didlex

1

asked Sep 3, 2023 at 12:05

0 votes

0 answers

43 views

Why does my valid data keep outputting onto the wrong switch statement

My valid data(Records.txt) keeps outputting onto the wrong case statment. Records.txt: AB12MP349 Fusion5 20 17000.00 33435KMOP324 BMW 40 25000.00 AB12MP349 Audi 100 4000.00 AB12MP349 Pagni 1 2000000....

Zximy

1

asked Sep 2, 2023 at 19:35

2 votes

3 answers

68 views

Process the python dictionary to remove undesired elements and retain desired ones

I have a python dictionary as given below: ip = { "doc1.pdf": { "img1.png": ("FP", "text1"), "img2.png": ("NP", "...

lowkey

140

asked Aug 16, 2023 at 15:10

1 vote

1 answer

56 views

dividing each sample by its maximum feature value separately, or dividing all samples by the maximum value across the entire dataset

I am trying to reproduce a paper that uses the tf-idf method. During the data preprocessing, there is a step that involves feature scaling. In the original paper, it says, "We restrict the words ...

yi zhu

11

asked Aug 2, 2023 at 6:43

1 vote

2 answers

836 views

Delta table partition folder name is getting changed

I am facing an issue where the expected date parition folder should be named in format date=yyyymmdd, but instead writing as - Sometimes for each parquet file created in delta path, it's creating a ...

Arindam Bhattacharjee

21

asked Aug 1, 2023 at 4:15

0 votes

0 answers

318 views

Aws IAM role Chaining , session timeout need to be more than 2 hours to run job

I am working on Data processing in which I have EKS cluster in one account and doing processing in second aws account , so we are assuming IAM role from One account to another and performing ...

Rutik Lohade

1

asked Jul 15, 2023 at 18:26

0 votes

1 answer

62 views

How to convert data to a regular tabular dataset after Run Length Encoding (RLE) transform

I have observations that are formed using Run Length Encoding transform as Example set.seed(1) make_data <- function() { series <- rnorm(sample(10:50,1)) |> cumsum() |> sign() ...

mr.T

634

asked Jul 14, 2023 at 7:19

1 vote

1 answer

31 views

how to convert tabular data correctly for objects with different lengths

I have data as objects like this set.seed(1) make_rle <- function() rnorm(10) |> cumsum() |> sign() |> accelerometry::rle2(indices = T) X <- lapply(1:10, \(x) make_rle()) X [[1]] ...

mr.T

634

asked Jul 11, 2023 at 14:56

1 vote

3 answers

75 views

Remove row if exist duplicated value in Numpy

I'm trying to find an efficient way to remove rows of numpy array that contains duplicated elements. For example, the array below: [[1,2,3], [1,2,2], [2,2,2]] should keep [[1,2,3]] only. I know pandas ...

Xe-

25

asked Jul 11, 2023 at 12:23

-1 votes

2 answers

838 views

How can I read and write CSV files and process the data into arrays in Java?

I am working on a Java project where I need to handle CSV files. Specifically, I need to read and write CSV files and process the data into arrays for further manipulation. I have researched different ...

DamianBautista

1

asked Jul 8, 2023 at 20:56

0 votes

0 answers

107 views

In python, can I define a polynomial function with a user-defined power and coefficients, that I can reference for future calculations?

Some preface, I have been teaching myself python for the past few days for a project, with almost no history of coding beyond some dabbling with MATLAB, so I apologize if there is something very ...

Luke M

13

asked Jul 5, 2023 at 18:55

1 vote

1 answer

96 views

How can a data processor pass the latest caching time to the FLUIDTEMPLATE?

The results from a data processor in the fluid template are cached. My data processor determines a list of images and a maximum time until which the list can be cached. How do I forward this ...

Dr. Dieter Porth

21

asked Jul 4, 2023 at 5:02

Collectives™ on Stack Overflow