Published on

Text normalization

Authors

Check if string not use standard ASCII characters


_15
import (
_15
"unicode"
_15
)
_15
_15
func isASCII(s string) bool {
_15
for _, c := range s {
_15
if c > unicode.MaxASCII {
_15
return false
_15
}
_15
}
_15
return true
_15
}
_15
_15
// isASCII("홍길동") false
_15
// isASCII("Hong Gildong") true

Use case: If I have duplicate email but different name, so prefer the native string or ASCII string we can easily determine which name we use

  • González vs Gonzalez - pick González
  • Yamada vs 山田 - pick 山田
  • 홍길동 vs Hong Gildong - pick 홍길동

Normalize normal alphanumeric characters if using non ASCII


_22
import (
_22
"unicode"
_22
"golang.org/x/text/unicode/norm"
_22
)
_22
_22
func isASCII(s string) bool {
_22
for _, c := range s {
_22
if c > unicode.MaxASCII {
_22
return false
_22
}
_22
}
_22
return true
_22
}
_22
_22
func Normalize(s string) string {
_22
if !isASCII(s){
_22
return norm.NFKD.String(s)
_22
}
_22
return s
_22
}
_22
_22
// Normalize("James") -> James

Use case: if the string is more commonly written in the common ASCII set of characters, prefer the ASCII version.