Text normalization

Text normalization

January 14, 2023

Moch Lutfi
Name
Moch Lutfi
Twitter
@kaptenupi

Check if string not use standard ASCII characters


import (
"unicode"
)
func isASCII(s string) bool {
for _, c := range s {
if c > unicode.MaxASCII {
return false
}
}
return true
}
// isASCII("홍길동") false
// isASCII("Hong Gildong") true

Use case: If I have duplicate email but different name, so prefer the native string or ASCII string we can easily determine which name we use

  • González vs Gonzalez - pick González
  • Yamada vs 山田 - pick 山田
  • 홍길동 vs Hong Gildong - pick 홍길동

Normalize normal alphanumeric characters if using non ASCII


import (
"unicode"
"golang.org/x/text/unicode/norm"
)
func isASCII(s string) bool {
for _, c := range s {
if c > unicode.MaxASCII {
return false
}
}
return true
}
func Normalize(s string) string {
if !isASCII(s){
return norm.NFKD.String(s)
}
return s
}
// Normalize("James") -> James

Use case: if the string is more commonly written in the common ASCII set of characters, prefer the ASCII version.