1

I am newer at coding and can't wrap my mind around what I'm doing wrong. This function takes in text from pdf's and starts collecting text only after a certain "search phrase" is found. Then it collects instances of two digit numbers followed by a period, comma, or neither and uses the first one to determine the rest of the sequence.

raw text example:

blah blah search phrase 40. blah blah \nblah \n41. blah blah 44. blah \nblah blah stop keyword blah blah blah"

And I would want the output to be...

"special feature 40.": "blah blah blah"
"special feature 41.": "blah blah"
"special feature 44." "blah blah blah"

And since I would expect the last special feature to contain the rest of the text, I've set up stop keywords that cut off the last special feature once one of them is reached (or cuts it off at 150 characters if it doesn't find one).

The stop keywords are working well for the last special feature, but I can't figure out how to handle multi-line special features before then. I can only get the first line. Most of the two digit numbers begin on a new line but I'm also struggling with those that appear in the middle of a line, particularly when they come after a different number that's not in the sequence. For Example:

\n19 blah blah 44. blah

Hopefully this makes sense. It's a mess but I'm desperate. Thank you!

def organize(text, search_phrase, text_found):
    features = {}
    if text:
        lines = text.splitlines()
        collecting = False
        last_number = None
        buffer = []
        feature_sequence_ended = False
        pattern = re.compile(r'(?:^|\s)(\d{2})[.,]?\s+(.*)', re.IGNORECASE)
        stop_keywords = [not, going, to, list, them, all ,out]

        def find_stop_keyword(line):
            line_lower = line.lower()
            positions = [
                line_lower.find(keyword)
                for keyword in stop_keywords
                if keyword in line_lower
            ]
            return min(positions) if positions else None

        def is_feature_line(line):
            return bool(pattern.search(line.strip()))

        i = 0
        while i < len(lines):
            line = lines[i].strip()

            if not collecting:
                for phrase in search_phrase:
                    if phrase in line.lower():
                        collecting = True
                        text_found = True
                        print(f'Search phrase "{phrase}" found')
                        break
                i += 1
                continue

            matches = list(pattern.finditer(line))
            if matches:
                for match in matches:
                    current_number = int(match.group(1))
                    content = match.group(2).strip()

                    if last_number is None or (current_number > last_number and current_number <= last_number + 5):
                        if last_number is not None and buffer:
                            full_text = " ".join(buffer).strip()
                            features[f"special feature {last_number:02}"] = full_text[:150]

                        last_number = current_number
                        buffer = [content]

                    else:
                        buffer.append(content)
            else:
                if last_number is not None and not feature_sequence_ended and not is_feature_line(line):
                    lookahead_range = 6
                    found_stop = False
                    found_future_feature = False
                    stop_index = None

                    for j in range(1, lookahead_range + 1):
                        if i + j < len(lines):
                            future_line = lines[i+j].strip()
                            if is_feature_line(future_line):
                                found_future_feature = True
                                break
                            if find_stop_keyword(future_line):
                                found_stop = True
                                stop_index = i+j
                                break

                    if found_stop and not found_future_feature:
                        if stop_index is not None and stop_index < len(lines):
                            stop_line = lines[stop_index].strip()
                            print(f"Stop keyword found after last feature: '{stop_line}'")
                        if buffer and last_number is not None:
                            full_text = ' '.join(buffer).strip()
                            features[f"special feature {last_number:02}"] = full_text[:150]
                        feature_sequence_ended = True
                        break
    
                else:
                    buffer.append(line)
            i += 1
                    
        if last_number is not None and buffer:
            full_text = " ".join(buffer).strip()
            features[f"special feature {last_number:02}"] = full_text[:150]

    return features, text_found
4
  • I haven't worked through the full code yet, but i am pretty sure your line with stop_keywords = [not, going, to, list, them, all ,out] should have your stop keywords wrapped with 'keyword'. Also, when working with regular expressions (your re.compile(r'(?:^|\s)(\d{2})[.,]?\s+(.*)', re.IGNORECASE) is preparing a regex or regular expression), I recommend playing around on regexr.com. On this site, you can paste in some test data, and see what your expressions do. Commented Sep 30 at 6:08
  • Maybe first use print() (and print(type(...)), print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called "print debugging" and it helps to see what code is really doing. Commented Sep 30 at 15:04
  • you could create minimal working code with example data - so we could simply copy and test it. Commented Sep 30 at 15:04
  • maybe you shouldn't split to lines but split on 40. 41. 44., etc. Commented Sep 30 at 15:10

1 Answer 1

0

I don't expect this example to be the implementation you will use, as your question does not clarify some requirements. I only hope that it will serve as a basis for you to achieve your goal.

I understand that the search for “characteristics” is performed after finding a keyword, if this is not the case, you can skip the check.

Using the following string as an example:

text = “blah blah searchWord 40. blah blah \nblah \n41. blah blah 42. blah \nblah 19 blah going blah blah blah”

our keyword would be “searchWord.” with that as a basis, we can start writing our method.

def organize( originalText, searchWord ):
    features = {}
    stop_keywords = [ "not", "going", "to", "list", "them", "all" ,"out" ]

      # with the **replace( “\n”, "" )** method, we remove all line breaks,
      # and with **split( “ ” )**, we divide the text into words.
    text = originalText.replace( "\n", "" ).split( " " )

      # determines whether we are searching for the initial
      # word or creating the dictionary entry
    isFound = False

      # the dictionary key
    featureId = ""

      # the dictionary value
    phrase = ""

Our first goal is to find the keyword to start with.

for word in text:               
    if word == searchWord:
        isFound = True   

Until we find this keyword, we don't want anything else to happen, so we put the rest of the code inside the elif

    elif isFound: 

Once this stage is complete, we first check whether the current word is contained in stop_keywords. If so, we update the dictionary and exit the for loop.

    elif isFound:      
        if word in stop_keywords:
            features[ featureId ] = phrase
            break

If it is not a terminating word, we verify that it is a word that identifies the feature. To do this, we will use the auxiliary method isFeatureId(), which we will see later, if this method returns “true,” we check if phrase has content, in which case we create a new entry in the dictionary and “empty” phrase, then assign featureId the fixed text of each feature plus the content of word, which will then form part of the key for the dictionary.

    elif isFound:      
        if word in stop_keywords:
            features[ featureId ] = phrase
            break
        elif isFeatureId( word ):
            if phrase != "":
                features[ featureId ] = phrase
                phrase = ""
            featureId = "special feature " + word
        else:
            phrase += word + " "
return features

Now let's look at the isFeatureId() method, but first, we need to determine how to differentiate any number from one that is part of a dictionary key. My example is based on having an initial value for these numbers and checking that the subsequent ones are consecutive, but only you know the true condition, and it could be that this is greater than “40.”

This method is very simple. First, it checks that the word size is correct, then that the first two letters match the string representation of the expected number for featureId, and finally that the remaining character is “,” or “.”. If any of these conditions are not met, “false” is returned. Otherwise, the value of initialFeatureIdNumber is increased and “true” is returned.

def isFeatureId( word ):
    global initialFeatureIdNumber
    if len( word ) < 2 or len( word ) > 3 or int( word[ 0 : 2 ] ) != initialFeatureIdNumber or word[ 2: 3 ] not in ",.":
        return False
    initialFeatureIdNumber += 1
    return True 

All together

  # Here I have used “searchPhrase” as the word from which to start the
  # search, I have added one of the words that end the search, and I
  # have modified the value of the last *key* from ‘44’ to “42” so that
  # all the *keys* are consecutive.
text = "blah blah searchPhrase 40. blah blah \nblah \n41. blah blah 42. blah \nblah 19 blah going blah blah blah"

stop_keywords = [ "not", "going", "to", "list", "them", "all" ,"out" ]
initialFeatureIdNumber = 40

def isFeatureId( word ):
    global initialFeatureIdNumber
    if len( word ) < 2 or len( word ) > 3 or int( word[ 0 : 2 ] ) != initialFeatureIdNumber or word[ 2: 3 ] not in ",.":
        return False
    initialFeatureIdNumber += 1
    return True 
    
def organize( originalText, searchWord ):
    text = originalText.replace( "\n", "" ).split( " " )
    isFound = False
    featureId = ""
    features = {}    
    phrase = ""

    for i in range( len( text )):
        word = text[ i ]                 
        if text[ i ] == searchWord:
            isFound = True     
        elif isFound:      
            if text[ i ] in stop_keywords:
                features[ featureId ] = phrase
                break
            elif isFeatureId( word ):
                if phrase != "":
                    features[ featureId ] = phrase
                    phrase = ""
                featureId = "special feature " + word
            else:
                phrase += word + " "

    return features

print( organize( text, "searchPhrase" )) 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.