0

I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark. Below is my input data in a column called “whole_text”. The string “line_number:” will always be in each row followed by the number and hyphen. Is there any way I can find the text “line_number:” and first hyphen after that and extract the number in between?

The output should be 121, 3112 and so on in a new column.

Please help.

text:ABC12637-XYZ  line_number:121-ABC:JJ11
header:3AXYZ166-LMN  line_number:3112-GHI:3A1

1 Answer 1

1

Some minimal example code would be useful to replicate your problem..

Here is how I'd solve this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("""
text:ABC12637-XYZ  line_number:121-ABC:JJ11
header:3AXYZ166-LMN  line_number:3112-GHI:3A1
""",)], ['str'])

df.select("str", F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)").alias('extracted')).show()

Which produces:

+--------------------+-----------+
|                 str|  extracted|
+--------------------+-----------+
|\ntext:ABC12637-X...|[121, 3112]|
+--------------------+-----------+

Update:

 df.withColumn('extracted_regex', F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)")).show()
+--------------------+---------------+
|                 str|extracted_regex|
+--------------------+---------------+
|\ntext:ABC12637-X...|    [121, 3112]|
+--------------------+---------------+

Using Python 3.12 and Spark 3.5

>>> spark.version
'3.5.0'
Sign up to request clarification or add additional context in comments.

2 Comments

I am getting below error while trying to run: E Literals of type 'R' are currently not supported.(line 1, pos 32) E E == SQL == E regexp_extract_all(detail_text, r'line_number:(\d+)-', 1) E --------------------------------^^^ this is my code: df.withColumn('extracted_regex', F.expr(r"regexp_extract_all(detail_text, r'line_number:(\d+)-', 1)"))
It still works on me even if I use df.withColumn (updated answer to contain this)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.