1

I have a piece that already has some formatting. Now I need to convert this to a format so I can use the Wordpress API to send it to wordpress.

This is an example of my text:

'**H1: Some text**\n\nSome text as paragraph.\n\n**H2: A subheader**\n\nText from the subheader.\n\nA line break with some more text.\n\n**H2: Another sub hearder**\n\n**H3: A sub sub header

I tried this:

test = myFullText
header1 = re.findall('H1.*?ph.', test)  

And

 test = myFullText
 header1 = re.findall('H1.*?\n\n.', test)  

Both give me empty "header1"

More general question. I assume the findall function is the best approach for my use case. Or is there another option to achieve this. Like I mentioned. My ultimate goal is to create a Wordpress blogpost from this text.

1
  • 1
    Yes, it fine, Better you can use regular expressions Match headers with optional content following headerpattern = r"**(H\d): (.*?)**" headers = re.findall(header_pattern, test) Commented Jul 7 at 9:21

1 Answer 1

3

Yes, it fine, Better you can use regular expressions Match headers with optional content following

headerpattern = r"\*\*(H\d): (.*?)\*\*"
headers = re.findall(header_pattern, test)
Sign up to request clarification or add additional context in comments.

4 Comments

I use chatGpt for my initial text. Now it returns # for H1 and ## for H2 etc. Using your example I am able to get the individual texts. However it includes the # character. How can I exclude that?
pattern = r"^(#{1,6})\s+(.*)$" matches = re.findall(pattern, text, re.MULTILINE) for level, content in matches: print(f"H{len(level)}: {content}") Markdown defines up to 6 levels of headers, so: #{1,6} means “match between 1 and 6 # characters”
This is great. I indeed get all the headers. One last ;-) question, how can I also include the paragraphs? Now I get this: '#' -so this is a H1 'Some text' - H1 content And now I would like to add: 'Some text as paragraph' - Paragraph/content belonging to H1
import re text = """ # Header 1 This is the first paragraph under header 1. ## Header 2 This is some text under header 2. Another paragraph under the same header. ### Header 3 More content here. """ # Pattern: Match headers and their following content pattern = r"^(#{1,6})\s+(.*?)\n(.*?)(?=\n#{1,6}\s+|\Z)" # \Z = end of text matches = re.findall(pattern, text, re.DOTALL | re.MULTILINE) for level, header, content in matches: header_level = len(level) clean_content = content.strip() print(f"H{header_level}: {header}") print (f"Paragraph:\n{clean_co

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.