Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
252 views

I am facing the below error while running the given piece of spark code on my local Pycharm Community Edition and the spark session is not getting created. I have set up all my local environment ...
Node98's user avatar
  • 27
0 votes
2 answers
72 views

I'm unable to send PySpark data frame as an attachment in Excel. I'm able to do easily with CSV file using below, email.add_attachment(df.toPandas().to_csv(index=False).encode('utf-8') , maintype='...
Jim Macaulay's user avatar
  • 5,251
0 votes
2 answers
122 views

I'm working with a third party dataset that includes location data. I'm trying to extract the Longitude and Latitude coordinates from the location column. As stated in their doc: The location column ...
MyNameHere's user avatar
0 votes
1 answer
53 views

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it. DF1: Claim_number Claim_Status 1001 Closed 1002 In Progress 1003 open Df2: ...
Srinivasan's user avatar
0 votes
1 answer
64 views

I want to open Excel files in Azure Databricks that reside in ADSL2 with this code: #%pip install openpyxl pandas import pandas as pd display(dbutils.fs.ls("/mnt/myMnt")) path = "/mnt/...
Prefect73's user avatar
  • 341
1 vote
0 answers
65 views

I have multiple UDFs in a codebase I inherited. Is there any way to remove and implement without the UDF? I'm running on 1.3B rows, so every bit helps. I considered using apply on a function, but ...
user3480774's user avatar
1 vote
0 answers
69 views

I'm trying to do resampling using pandas api for spark - (databricks runtime 15.4, spark 3.5.0) import numpy as np from datetime import datetime import pandas as pd import pyspark.pandas as ps dates ...
Mariusz Jarczak's user avatar
1 vote
0 answers
43 views

Here is its source code. PySpark has a wapper layer to allow using Pandas API on Spark. Pandas would compute in a single processor unit and it rely on indices. Spark, however, do not use the concept ...
klenium's user avatar
  • 2,637
0 votes
0 answers
34 views

I am using Pandas in Spark API for some data preprocessing files which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot ...
Chaitanya Kulkarni's user avatar
0 votes
1 answer
147 views

I am getting the following error when using pyspark pandas: PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use '...
marjun's user avatar
  • 736
0 votes
0 answers
178 views

I'm trying to do something straightforward in Pandas -- take some time series data and resample it to a minute. But I'm running into a variety of issues from Spark (I'm new to PySpark so be kind ;) ) ...
Zach's user avatar
  • 1,421
1 vote
0 answers
36 views

So basically I have created a PySpark function named segmentation that performs cumulative sum calculation, handles outliers, finds the maximum cumulative sum, calculates a decile, and updates a ...
DEVEN MALI's user avatar
0 votes
1 answer
324 views

I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark. Below is my input data in a column called “...
Rohit Kadam's user avatar
0 votes
2 answers
84 views

I want to convert a numeric column which is resembling a timedelta in seconds to a ps.TimedeltaIndex (for the purpose of later resampling the dataset) import pyspark.pandas as ps df = ps.DataFrame({&...
ascripter's user avatar
  • 6,315
0 votes
1 answer
997 views

from pyspark.sql import SparkSession from pyspark import SparkContext, SparkConf from pyspark.storagelevel import StorageLevel spark = SparkSession.builder.appName('TEST').config('spark.ui.port','...
Anand Reddy's user avatar
1 vote
1 answer
565 views

I asked a fairly similar yet different question and got a good response here: Groupby and percentage distributions pyspark equivalent of given pandas code I am not sure how to tailor the modification ...
bernando_vialli's user avatar
0 votes
1 answer
848 views

I could able to read the Excel file data using Crealytics library spark-excel_2.12-3.4.1_0.19.0 but was not able to execute the same code by using the latest version spark-excel_2.12-3.5.0_0.20.1. I ...
Ramesh Bathini's user avatar
0 votes
1 answer
123 views

data_2['col1'] = np.where((df1.year.astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1) This is my original working code in Gen1, but it isgiving following error in gen2. ...
DigiLearner's user avatar
0 votes
0 answers
126 views

trying to read this file: abfss://[email protected]/perils_database/Bushfire AUSTRALIA - PERILS Industry Exposure Database 2023.xlsx Using this Python code: import pyspark....
TheRealJimShady's user avatar
0 votes
1 answer
2k views

I have about 30 Excel files that I want to read into Spark dataframes, probably using pyspark.pandas. I am trying to read them like this: import pyspark.pandas as ps my_files = ps.DataFrame(dbutils.fs....
TheRealJimShady's user avatar
1 vote
1 answer
84 views

I am using Pandas on Spark. I need to groupby A and B and then aggregate to return a list of map where keys are C and values are D Sample input: A B C D 0 7 ...
WorkInProgress's user avatar
2 votes
1 answer
53 views

I have a dataframe which contains Company name, EmpId, Bonus & Salary. COMPANY EMPID BONUS SALARY APPLE 1234 No 5 APPLE 1235 No 7 GOOGLE 6786 Yes 6 GOOGLE 6787 No 5 GOOGLE 6788 No 6 TARGET 9091 ...
sanju's user avatar
  • 49
-1 votes
1 answer
122 views

I have a dataframes like below. Tried Join and isin functions but not getting the expected output like below. Not sure what I was missing. Appreciate if someone can help. Thanks. DF1: Name Grade Tom A ...
sanju's user avatar
  • 49
1 vote
0 answers
88 views

To solve the equations of the below nature with Pandas-on-Spark, I created a class named "DiffEqns". x[k+1] = v[k] * y[k] + x[k] y[k+1] = 0.01 * z[k] + y[k] where k ranges from 0 to N-1, x[...
lord_mendonca's user avatar
0 votes
1 answer
75 views

We have a code that takes care of customer orders from a retail website, the volume of data is pretty large and is getting bigger as days goes by. The tricky part is that this data is in a different ...
Trodenn's user avatar
  • 17
0 votes
1 answer
52 views

i have a spark dataframe df vehicle_coalesce vehicleNumber productionNumber pin checkDate V123 V123 P123 null 27/08/2023 01:03 P123 null ...
karthik kk's user avatar
0 votes
1 answer
59 views

I have a spark dataframe dataframe vehicle_coalesce vehicleNumber productionNumber pin checkDate V123 V123 P123 null 27/08/2023 01:03 P123 null ...
karthik kk's user avatar
0 votes
2 answers
364 views

I have an Excel sheet of the formula I need to convert into Pyspark code considering columns A, B, C, D, E, F, G, H and I where columns F, G, H and I have fixed random numeric values. Column A has ...
prince13i's user avatar
0 votes
1 answer
39 views

All values are strings: first_name_l first_name_r last_name_l last_name_r dob_l dob_r city_l city_r average_score matched_columns lookup_l_list lookup_r_list robert robert null allen 1971-06-24 1971-...
mahak tirole's user avatar
0 votes
1 answer
316 views

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is ...
lord_mendonca's user avatar
0 votes
1 answer
414 views

I have a question on how the best way to implement the following problem. I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset. In ...
Vinícius Matheus Olivieri's user avatar
0 votes
2 answers
46 views

This is the pyspark dataframe And the schema of the dataframe. Just two rows. Then I want to convert it to pandas dataframe. But it is suspended at stage 3. No result, and no information about the ...
Sparrow  Jack's user avatar
0 votes
2 answers
613 views

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come ...
zenith7's user avatar
  • 201
1 vote
1 answer
67 views

I have a data frame like this, col1 col2 100 3 200 2 300 4 400 1 Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like ...
Kallol's user avatar
  • 2,189
1 vote
1 answer
2k views

df = spark.table("data").limit(100) df = df.toPandas() This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, ...
dhk02's user avatar
  • 11
1 vote
1 answer
579 views

I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models ...
Rayzee's user avatar
  • 13
0 votes
1 answer
547 views

My input is parquet file with I need to recast as below: df=spark.read.parquet("input.parquet") psdf=df.to_pandas_on_spark() psdf['reCasted'] = psdf['col1'].astype('float64') psdf['reCasted'...
user2531569's user avatar
0 votes
1 answer
237 views

I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by ...
Raja Sabarish PV's user avatar
0 votes
2 answers
350 views

Suppose a given dataframe: Model Color Car Red Car Red Car Blue Truck Red Truck Blue Truck Yellow SUV Blue SUV Blue Car Blue Car Yellow I want to add color columns that keep a count of each color ...
jay-elliot's user avatar
0 votes
1 answer
894 views

i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on ...
KIRAN KUMAR's user avatar
0 votes
1 answer
194 views

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), ...
Rory's user avatar
  • 383
0 votes
1 answer
162 views

I've got this function: def ead(lista): ind_mmff, isdebala, isfubala, k1, k2, ead = lista try: isdebala = float(isdebala) isfubala = float(isfubala) k1 = float(k1) ...
JMP's user avatar
  • 38
-1 votes
1 answer
785 views

What is the best way to store streaming data from different stock exchanges in order to minimise data weight? Right now I'm using CCXT library on Python and in order to get current order book ...
Ruslan Kirsanov's user avatar
0 votes
1 answer
370 views

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value&...
pscodes's user avatar
  • 11
2 votes
1 answer
337 views

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...
Psychotechnopath's user avatar
0 votes
1 answer
89 views

Can anybody explain the following behavior? import pyspark.pandas as ps loan_information = ps.read_sql_query([blah]) loan_information.shape #748834, 84 loan_information.apply(lambda col: col.shape) ...
Cody Dance's user avatar
1 vote
1 answer
52 views

Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column. I have a master table like this : id ...
rezha nanda's user avatar
4 votes
1 answer
1k views

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...
Psychotechnopath's user avatar
0 votes
1 answer
319 views

I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit): def ...
Psychotechnopath's user avatar
0 votes
1 answer
1k views

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it? Example- df: col1 "xyznm""...
alka's user avatar
  • 81