Day 2: String Manipulation using RegEx

So, after successfully understanding, writing and compiling our 1st program we are all set to move further.

Today, on Day 2 we will do some interesting operations on string using RegEx. Beginning with our 1st program which was to replace any digits present in string with an underscore, let’s do the opposite operation now. That is, to replace (or substitute) non- digits with an underscore sign.

Task 3: Our 2nd task is to write a program that will replace all the non- digits in a string with a ‘_’ sign?

Referring Table 1 again, similar to \d which specifies digits, we have \D which specifies non-digits. It means that if in our 1st program, if we replace \d with \D, our objective must be achieved. Let’s check it out.

It’s working cool. The non-digits in string are substituted by an underscore.

What about using Quantifiers?

The quantifier ‘+’ is working also in similar manner like it was in 1st program.

We should try 1 more thing. Do non-digits mean everything except digits or it only means alphabets? What about any special character?

If we see, the special character @ is also substituted by underscore. That’s great. That’s what non-digit actually means.

Like \d is equivalent to [0-9], \D is equivalent to [^0-9], where ^ sign can be ‘treated’ as negation, here. So, replacing \D with [^0-9] will give same result.

Now, we are familiar with \d and \D with sub function. What if someone says to find out all the digits in the input string? We know that for digits we will use \d. To find all digits, RegEx gives a function called findall.

What this 3rd line of code is doing? It’s actually finding all the digits of the given input string and printing the result in form of list. Also, I have done 1 more change to demonstrate only. I have not written that 2nd line which I stressed to be mandatory. And as said earlier, everything will work smoothly. It’s simply a good practice.

What about Quantifier? Will it affect in some way?

So, the ‘+’ has converted the list of 3 elements as 1 element, but we still know all the digits present in the string.

^ and $

^, and, $ are boundaries. ^ marks the start, while $ marks the end of a regular expression. These are very handy syntax to play with.

Wait a while…

^ used earlier denoted negation. Then, here it is being mentioned as starting point of a regular expression. When used in square brackets [^ …] it means not. Let’s do some program to justify our statements.

Task 4 Write a program to verify the 1st letter of input string is correct as it was entered.

The input string entered was Bond as its 1st word. ^Bond is checking whether the 1st word is Bond or not. If loop is executed after checking and result was printed.

What if 1st word was not Bond and something else?

Say, the 1st word is James. Then, accordingly, else loop should be executed.

Nice!!

Is it checking the 1st word or 1st letter? Actually, it’s checking the 1st both. ^Bond or ^James is executing according to our desired answer here because of the 1st letter, i.e., B and J and as well as Bond and James. If we change the word but keep the 1st letter same, it will not work.

But if we check for 1st letter only then it will give the desired answer.

Let’s clarify this with another example, but, by checking the last letter, i.e. using $ sign.

Task 5 Write a program to verify the last letter of input string is correct as it was entered.

For checking the last letter, $ sign is used. 007$ or 7$ has the same meaning. But 807$ will execute else loop.

Moving on, again referring Table 1, let’s check out another special sequence \A.

Task 6 Write a program to verify the 1st letter of input string is correct as it was entered, without using ^ operator.

So, \A will check for entire 1st word and give the result accordingly. It is similar to using the ^ sign.

Task 6 Write a program to search any given word of input string and also verify its position.

Say the string entered is “The truth is…I am Iron Man”. I have to find that if the word Iron exists in this string and also what is its position.

So, there is a match. Here, we introduced a new function search (see line 3 of above code). It’s searching for the given word in the input string.

Also, look at the output.

If you count, then it’s observable that, after 20 positions (including spaces), the word Iron is starting and goes till 24th position.

The search () function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

There are two The in the input string. But the search function will give the result of only the 1st matching it encountered.

If we see, findall and search are doing the same thing. So, what’s the difference? Writing the above code using findall will give…what? Let’s check:

Wow!! So findall is giving all the same words. But it’s not giving the span, while search is giving the 1st matched occurrence with span.

Moving forward towards other important function split

The split () function returns a list where the string has been split at each match.

So, split () simply creates a list. \s is splitting the string at each white-space occurrence.

There is a special case with split () function which allows to split n words. Suppose I want to split 1st two words of my string and not the remaining. Then, it goes like this:

Look at the 3rd line of code. There is a number ‘2’, which indicates how many words to split. What if it becomes 1?

Only 1 word is splitted and remaining words are combined as 1 word in output list.

Match Object

A Match Object is an object containing information about the search and the result. There are some attributes related to search operation:

.span () returns a tuple containing the start-, and end positions of the match.

.string returns the string passed into the function

.group () returns the part of the string where there was a match

Before ending Day 2, let’s do some tasks using these search function parameters.

Task 7 Write a program to search for an upper case character in the beginning of a word, and print its position.

For example the string is “I am Iron Man”. The task is to find the position of, say, letter M of word Man. What should be the approach?

1. Check for the 1st character of the string. For example, The 1st word here is not M. It’s I. So, I can’t use ^ or \A.

2. Referring Table 1 again. There are two special sequence \b and \w.

3. \b returns a match where the specified characters are at the beginning or at the end of a word.

4. \w returns a match where the string contains any word characters

5. Finally, using .span () will do the job. The final code will look like this:

3rd line of code: \b will return a match where the specified characters are at the beginning (or at the end) of a word which is M (in this example) and \w will return the match where the string contains any word character. .span () returns a tuple containing the start-, and end positions of the match, i.e. (10, 13), in our example.

Task 8 Write a program to search for an upper case character in the beginning of a word, and print that complete word.

For example the string is “I am Iron Man”. The task is to find the complete word of Man if my letter is, say, M.

Here, I will use .group () instead of .span ()

.string will simply return the string given.

That concludes our Day 2.

Leave a comment