89DEVs

Python: How to remove non ascii characters?

How to remove non-ascii characters in Python?

To remove non-ascii characters in Python, use the encode() method of strings. Set the encoding argument to 'ascii' and the parameter errors argument to 'ignore'. An example is shown further below. A second and more flexible solution to remove non-ascii character is the regex module. It provides functionality to replace specific characters and strings or replacing them by other characters. By using regex, it is possible to replace non-ascii characters with ascii-characters. An example of this, is shown below. Regex provides also functionality to find specific characters and strings and to split them. The split() method of regex returns a list where the string has been split at a defined character or string. This can be used for example to split a string at each space character. Then a list is returned that includes each word as separate element.

remove non-ascii characters using encode() method

The encode() method of strings is used to remove non-ascii characters as shown below. # remove non-ascii characters using encode() method myString = "das ist großartig" myString.encode('ascii',errors='ignore') In this example, a variable myString is defined and the value is set to "das ist großartig". This string contains the non-ascii character ß. Then, the encode() method of strings is used. The encoding argument is set to 'ascii'. The errors argument is set to 'ignore' to ignore unencodable unicode from the result. Finally, an encoded version of the string is returned by the encode() method. b'das ist groartig'

remove non-ascii character using regex

It is also possible to remove specific characters with regex, this provides greater flexibility. For example, it's possible to just remove specific characters or replace them as shown further below. # remove non-ascii characters using regex import re myString = "das ist großartig" myString = re.sub('ß', '', myString) print(myString) In this example, first the re module is imported which includes regex functionality to find and replace (substitute) character and strings. The variable myString is defined and the value "das ist großartig" is assigned. Then, the sub() method is used to replace the character ß with an empty string. This removes the non-ascii characters completely without replacing them. The meaning of the sentence is therefore impacted. Finally, the string is printed and the result is shown below. das ist groartig

replace non-ascii characters using regex

Regex can also be used to replace characters instead of just removing them. For example, the ß character is replaced by two s characters to keep the meaning of the sentence the same. # replace non-ascii characters using regex import re myString = "das ist großartig" myString = re.sub('ß', 'ss', myString) print(myString) In this example, first the re module is imported which includes regex functionality. The variable myString is defined and the value "das ist großartig" is assigned. Then, the sub() method is used to replace the character ß with ss. This removes the non-ascii characters without removing the meaning of the sentence. Finally, the string is printed and the result is shown below. das ist grossartig

                
        

Summary


Click to jump to section