904 questions
0
votes
0
answers
84
views
How do I process CSV files from S3 using pandas without loading the entire file?
My current data processing flow looks like this
Load CSV
Pivot Data
Filter the original data based on some results from previous step
Repeat several times
I have this working on several CSV files in ...
0
votes
1
answer
39
views
How can I parse array-style string data that isn't in standard JSON format?
I have data in an array-style string format that I need to parse and insert into a DolphinDB table, with each nested array element becoming a separate record. Here's an example of the data:
'[["...
0
votes
1
answer
30
views
Achieving Grouping to Make the Sum of Each Group's Data Equal
I have a dataset with two columns: sID and sum_count. Now, I need to divide the sID into 5 groups with the requirement that:
The sum of the sum_count column in each group should be as equal as ...
1
vote
1
answer
91
views
Can I split names in a Google Sheet every time there is a new submission?
I am using the following script to split names with a Google Sheet that is receiving submissions from a Squarespace RSVP form.
function split() {
const DELIMITER = " ";
var ss = ...
0
votes
1
answer
56
views
Azure Blob CSV Appending Data Instead of Overwriting for Each Patient Processing using python
I have a Python script that processes person data and appends the results to an Azure Blob Storage CSV file. However, the issue is that for each new patient the generated csv is appending to the ...
3
votes
2
answers
94
views
How can I split this data into rows in a data frame with column names with pandas?
Each row of my data looks something like this:
8,0 0 1 0.000000000 8082 A WS 24664872 + 8 <- (8,2) 23604576
I'd like to split the data into columns like this:
col1 col2 col3 ...
0
votes
2
answers
88
views
How to Extract data from rows data collected from logs
Seeking help on how to extract data from rows of data similar to this
Raw Data
and convert data placement to this
Process Data
im having problem to extract "Process X" and populate the ...
0
votes
1
answer
236
views
np.load fails with ValueError: cannot reshape array of size (838715,) into shape (838710,)
I'm trying to save the scaling parameters of a dataset into a .npy file on the disk, so I avoid having to recalculate them every time I re-run the code.
For now, I'm using MaxAbsScaler() from sklearn ...
0
votes
1
answer
68
views
How to monitor a combination of io sensors in a stateful manner
My data source emits IOT data with the following structure -
io_id,value,timestamp
232,1223,1718191205
321,671,1718191254
54,2313,1718191275
232,432,1718191315
321,983,1718191394
........
There are 2 ...
0
votes
1
answer
160
views
JSON Data Stored as Null Values in Delta Lake Table Using PySpark
I encountered an issue while trying to store JSON data as a Delta Lake table using PySpark and Delta Lake.
Here's my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,...
-4
votes
1
answer
64
views
Is there an excel function to assign a binary result to a predefine data cell?
Sorry for the title, I know it might be pretty wide and not so much informative. I am facing a problem regarding the analysis of a data set. The participants of my experiments were randomly assigned ...
1
vote
1
answer
168
views
VBA code for processing data in an Excel file crashes after processing about 400-500 rows
I have coded a VBA macro to process downloaded data. The data has some junk rows that need to be deleted, but also has some rows where the data is off by a couple of columns and a couple of rows. The ...
1
vote
0
answers
59
views
pre-processing sequences for LSTM model (sign language recognition)
ive been working on a sign language recognition. i extracted landmarks with mediapipe, saved it as .parquets then padded the data to create uniform length. each row of landmark has 21 node with x,y,z ...
0
votes
0
answers
206
views
Fanout use case in Apache Flink
This is an opinion based question:
My use case is real-time and needs to be able to process everything in a sub-second speed.
I have an external mongo DB which holds information about all of the users ...
0
votes
1
answer
77
views
In NodeJS data is not receiving as chunks (stream) in consumer
I having a two services producer and consumer.
Producer has big json file in the server. I want to serve over the network through rest api and I used nodejs stream technique to load bytes in memory ...
0
votes
1
answer
56
views
Need Excel Macro to duplicate row per "X" marked in column (VBA)
Need a macro that will help me process data where they add an X to mark which group the row belongs to. For example:
The data comes with many more columns but that's just the gist of it. They mark ...
-1
votes
1
answer
47
views
Is there a function in R to reduce the number of zeros of a percentage in the y axis of a bar chart? [closed]
I wrote the code
ggplot(data = summary_datas) +
geom_bar(mapping = aes(x=member_casual,fill=member_casual)) +
labs(title = "Rider Membership data", subtitle= "Difference in the ...
1
vote
1
answer
675
views
Writing Waveform data into CSV file in LabVIEW
I have a LabVIEW program which contains voltage, current and power data into the same waveform. I am planning to extract each of them one by one and putting into array. Currently, I have extracted ...
1
vote
1
answer
88
views
How do I ensure that the user continues where they left off in the application in Flutter?
I made a book reading page with the Page_Flip widget, but when the user leaves the application and re-enters the book page, I want it to continue from where it left off. How can I keep the user's ...
1
vote
1
answer
131
views
How to synchronize datastreams in broadcastProcessFunction flink?
im using flink 1.81.1 api on java 11 and im trying to use a BroadcastProcessFunction to filter a Products Datastream with a brand autorized Datastream as broadcast.
So my first products Datastream ...
-1
votes
1
answer
63
views
Messy CSV auto header extractor [closed]
I have a bunch (100+) CSV files. Each of them can have blank rows, or rows I don't need (Some fuzz info like "Congrats, you all bla bla"). When reading in Pandas I need to specify which row ...
0
votes
1
answer
129
views
Read all rows with pd.read_csv
I'm using Python to read a file of 5,000,000 rows but currently it only reads 1,000,000 rows. The file is around 125mb.
I'm using the pd.read_csv function but this only leads to reading 1,000,000 rows ...
1
vote
1
answer
168
views
VBA Vlookup Function from Multiple Workbooks
I using a Vlookup function to bring over data from 4 different files into one sheet.
I want to place the vlookup results from the 1st file in Column 4, then the results of the 2nd file in Column 6 ...
-1
votes
2
answers
102
views
High performing dataframe join in Python
I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes ...
0
votes
0
answers
34
views
Solution for processing hierarchical structure with large number of leaf nodes in SQL
I'm working on a project which stores data of a tree-structured models like file systems and so on.
And in many cases the tree has large number of leaves in it and have unknown depth.
My project is ...
1
vote
1
answer
64
views
A lightweight approach to processing Django Queryset data
I am looking for a optimal way to perform simple data processing from Django Queryset. I would like to not need to install libraries with high volumes like Pandas or numpy. The number of rows in ...
0
votes
1
answer
919
views
How to Optimize Memory Usage When Processing Large CSV Files in Python?
I am working on a Python script to process large CSV files (ranging from 2GB to 10GB) and am encountering significant memory usage issues. The script reads a CSV file, performs various transformations ...
-1
votes
1
answer
124
views
Pandas read_fwf doesn't read the last digital each row [closed]
I have a .rpt file that has two columns, like this:
A column B column
990.E-03 -2.73654E-03
995.E-03 -2.75347E-03
1. ...
0
votes
1
answer
58
views
how to convert excel sheet to data processing using pandas?
enter image description herehow to convert this excel to data processing using pandas
import pandas as pd
df = pd.read_excel(r"c:/Users/vpullabh/Desktop/Meraci.Ec-NGIOSD.xlsx", sheet_name=&...
4
votes
1
answer
295
views
What's the time complexity of forward filling and backward filling in spark?
My question: Need to understand the time complexity of dynamic forward filling and back filling in spark
Hello, I have a scala job that reads Delta Table A, transforms Data Frame and writes to Delta ...
-1
votes
1
answer
2k
views
Excel in Large-Scale Data Processing with GPT-3.5 and Embeddings
I'm working on integrating OpenAI functionalities, specifically GPT3.5 and embeddings, into a large system of Excel workbooks used for almost anything in my office. Our goal is having GPT3.5 taking ...
1
vote
0
answers
89
views
Merging two files and expanding metadata efficiently
I'm dealing with a large file with each row with CHR and POS values (which are positional coordinates).
I process this file using a tool, but it outputs only a subset of these positional coordinates ...
1
vote
1
answer
661
views
Google Cloud Dataflow Job failed: Found unexpected parameters
FAILED NOTE
When I set up a Dataflow Pipeline and created a Job from template ('Text Files on Cloud Storage to BigQuery'), I meet this problem.
Job creation failed: The workflow could not be created. ...
1
vote
0
answers
344
views
Contour detection based on 4-connectivity using `findContours()` from OpenCV
the findContours() function from the OpenCV library does not allow you to customize the selection of contours based on 4-connectivity. I checked on a test image: all the modes of this function that ...
1
vote
0
answers
75
views
Parsing nested JSON into R List
I have a pretty straight forward JSON object that I am trying to parse into a list of objects for downstream processing and use. The JSON structure is dynamic but here is an example of the structure I ...
1
vote
2
answers
101
views
Process python dictionary based on previous, current and next value
I have a python dictionary as follows:
ip_dict = {'GLArch': {'GLArch-0.png': ['OTHER', 'Figure 28 TAC '],
'GLArch-1.png': ['DCDFP', 'This insurance '],
'...
0
votes
1
answer
168
views
Not quite understand a concept in Kimball's dimensional modeling
I have read through the idea "Behavior Tag Time Series" several times but couldn't understand
Here is the explanation in the book, but still not make sense:
"Almost all text in a data ...
0
votes
2
answers
49
views
How to find two threes in zip code using r
I need help with this task:
Print data for locations that have two threes in the address.zip code.
I tried:
filtered_data <- df %>%
filter(grepl("\\d{3}.*\\d{3}", address.zip))
...
-2
votes
1
answer
239
views
How can I correct my Time Series LSTM RNN for Binary Classification favoring Class 0?
I am attempting to predict a binary outcome based on 15 continuous sequences (except one which isn't a continuous line, but still a sequence). The dataset contains 933k datapoints for all 15 features ...
0
votes
0
answers
43
views
Why does my valid data keep outputting onto the wrong switch statement
My valid data(Records.txt) keeps outputting onto the wrong case statment.
Records.txt:
AB12MP349 Fusion5 20 17000.00
33435KMOP324 BMW 40 25000.00
AB12MP349 Audi 100 4000.00
AB12MP349 Pagni 1 2000000....
2
votes
3
answers
68
views
Process the python dictionary to remove undesired elements and retain desired ones
I have a python dictionary as given below:
ip = {
"doc1.pdf": {
"img1.png": ("FP", "text1"),
"img2.png": ("NP", "...
1
vote
1
answer
56
views
dividing each sample by its maximum feature value separately, or dividing all samples by the maximum value across the entire dataset
I am trying to reproduce a paper that uses the tf-idf method. During the data preprocessing, there is a step that involves feature scaling. In the original paper, it says, "We restrict the words ...
1
vote
2
answers
836
views
Delta table partition folder name is getting changed
I am facing an issue where the expected date parition folder should be named in format date=yyyymmdd, but instead writing as -
Sometimes for each parquet file created in delta path, it's creating a ...
0
votes
0
answers
318
views
Aws IAM role Chaining , session timeout need to be more than 2 hours to run job
I am working on Data processing in which I have EKS cluster in one account and doing processing in second aws account , so we are assuming IAM role from One account to another and performing ...
0
votes
1
answer
62
views
How to convert data to a regular tabular dataset after Run Length Encoding (RLE) transform
I have observations that are formed using Run Length Encoding transform
as Example
set.seed(1)
make_data <- function() {
series <- rnorm(sample(10:50,1)) |> cumsum() |> sign()
...
1
vote
1
answer
31
views
how to convert tabular data correctly for objects with different lengths
I have data as objects like this
set.seed(1)
make_rle <- function() rnorm(10) |> cumsum() |> sign() |> accelerometry::rle2(indices = T)
X <- lapply(1:10, \(x) make_rle())
X
[[1]]
...
1
vote
3
answers
75
views
Remove row if exist duplicated value in Numpy
I'm trying to find an efficient way to remove rows of numpy array that contains duplicated elements. For example, the array below:
[[1,2,3], [1,2,2], [2,2,2]]
should keep [[1,2,3]] only.
I know pandas ...
-1
votes
2
answers
838
views
How can I read and write CSV files and process the data into arrays in Java?
I am working on a Java project where I need to handle CSV files. Specifically, I need to read and write CSV files and process the data into arrays for further manipulation. I have researched different ...
0
votes
0
answers
107
views
In python, can I define a polynomial function with a user-defined power and coefficients, that I can reference for future calculations?
Some preface, I have been teaching myself python for the past few days for a project, with almost no history of coding beyond some dabbling with MATLAB, so I apologize if there is something very ...
1
vote
1
answer
96
views
How can a data processor pass the latest caching time to the FLUIDTEMPLATE?
The results from a data processor in the fluid template are cached.
My data processor determines a list of images and a maximum time until which the list can be cached. How do I forward this ...