I am newer at coding and can't wrap my mind around what I'm doing wrong. This function takes in text from pdf's and starts collecting text only after a certain "search phrase" is found. Then it collects instances of two digit numbers followed by a period, comma, or neither and uses the first one to determine the rest of the sequence.
raw text example:
blah blah search phrase 40. blah blah \nblah \n41. blah blah 44. blah \nblah blah stop keyword blah blah blah"
And I would want the output to be...
"special feature 40.": "blah blah blah"
"special feature 41.": "blah blah"
"special feature 44." "blah blah blah"
And since I would expect the last special feature to contain the rest of the text, I've set up stop keywords that cut off the last special feature once one of them is reached (or cuts it off at 150 characters if it doesn't find one).
The stop keywords are working well for the last special feature, but I can't figure out how to handle multi-line special features before then. I can only get the first line. Most of the two digit numbers begin on a new line but I'm also struggling with those that appear in the middle of a line, particularly when they come after a different number that's not in the sequence. For Example:
\n19 blah blah 44. blah
Hopefully this makes sense. It's a mess but I'm desperate. Thank you!
def organize(text, search_phrase, text_found):
features = {}
if text:
lines = text.splitlines()
collecting = False
last_number = None
buffer = []
feature_sequence_ended = False
pattern = re.compile(r'(?:^|\s)(\d{2})[.,]?\s+(.*)', re.IGNORECASE)
stop_keywords = [not, going, to, list, them, all ,out]
def find_stop_keyword(line):
line_lower = line.lower()
positions = [
line_lower.find(keyword)
for keyword in stop_keywords
if keyword in line_lower
]
return min(positions) if positions else None
def is_feature_line(line):
return bool(pattern.search(line.strip()))
i = 0
while i < len(lines):
line = lines[i].strip()
if not collecting:
for phrase in search_phrase:
if phrase in line.lower():
collecting = True
text_found = True
print(f'Search phrase "{phrase}" found')
break
i += 1
continue
matches = list(pattern.finditer(line))
if matches:
for match in matches:
current_number = int(match.group(1))
content = match.group(2).strip()
if last_number is None or (current_number > last_number and current_number <= last_number + 5):
if last_number is not None and buffer:
full_text = " ".join(buffer).strip()
features[f"special feature {last_number:02}"] = full_text[:150]
last_number = current_number
buffer = [content]
else:
buffer.append(content)
else:
if last_number is not None and not feature_sequence_ended and not is_feature_line(line):
lookahead_range = 6
found_stop = False
found_future_feature = False
stop_index = None
for j in range(1, lookahead_range + 1):
if i + j < len(lines):
future_line = lines[i+j].strip()
if is_feature_line(future_line):
found_future_feature = True
break
if find_stop_keyword(future_line):
found_stop = True
stop_index = i+j
break
if found_stop and not found_future_feature:
if stop_index is not None and stop_index < len(lines):
stop_line = lines[stop_index].strip()
print(f"Stop keyword found after last feature: '{stop_line}'")
if buffer and last_number is not None:
full_text = ' '.join(buffer).strip()
features[f"special feature {last_number:02}"] = full_text[:150]
feature_sequence_ended = True
break
else:
buffer.append(line)
i += 1
if last_number is not None and buffer:
full_text = " ".join(buffer).strip()
features[f"special feature {last_number:02}"] = full_text[:150]
return features, text_found
stop_keywords = [not, going, to, list, them, all ,out]should have your stop keywords wrapped with 'keyword'. Also, when working with regular expressions (yourre.compile(r'(?:^|\s)(\d{2})[.,]?\s+(.*)', re.IGNORECASE)is preparing a regex or regular expression), I recommend playing around on regexr.com. On this site, you can paste in some test data, and see what your expressions do.print()(andprint(type(...)),print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called"print debugging"and it helps to see what code is really doing.minimal working codewith example data - so we could simply copy and test it.40.41.44., etc.