PySpark regex to get value between a string and hyphen

Question

I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark. Below is my input data in a column called “whole_text”. The string “line_number:” will always be in each row followed by the number and hyphen. Is there any way I can find the text “line_number:” and first hyphen after that and extract the number in between?

The output should be 121, 3112 and so on in a new column.

Please help.

text:ABC12637-XYZ  line_number:121-ABC:JJ11
header:3AXYZ166-LMN  line_number:3112-GHI:3A1

ahaltindis · Accepted Answer · 2023-11-23 21:21:31Z

1

Some minimal example code would be useful to replicate your problem..

Here is how I'd solve this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("""
text:ABC12637-XYZ  line_number:121-ABC:JJ11
header:3AXYZ166-LMN  line_number:3112-GHI:3A1
""",)], ['str'])

df.select("str", F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)").alias('extracted')).show()

Which produces:

+--------------------+-----------+
|                 str|  extracted|
+--------------------+-----------+
|\ntext:ABC12637-X...|[121, 3112]|
+--------------------+-----------+

Update:

 df.withColumn('extracted_regex', F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)")).show()
+--------------------+---------------+
|                 str|extracted_regex|
+--------------------+---------------+
|\ntext:ABC12637-X...|    [121, 3112]|
+--------------------+---------------+

Using Python 3.12 and Spark 3.5

>>> spark.version
'3.5.0'

edited Nov 23, 2023 at 21:21

answered Nov 19, 2023 at 16:49

ahaltindis

3731 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rohit Kadam Over a year ago

I am getting below error while trying to run:

E   Literals of type 'R' are currently not supported.(line 1, pos 32) E    E   == SQL == E   regexp_extract_all(detail_text, r'line_number:(\d+)-', 1) E   --------------------------------^^^

this is my code: df.withColumn('extracted_regex', F.expr(r"regexp_extract_all(detail_text, r'line_number:(\d+)-', 1)"))

ahaltindis Over a year ago

It still works on me even if I use df.withColumn (updated answer to contain this)

Collectives™ on Stack Overflow

PySpark regex to get value between a string and hyphen

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related