Text normalization

0 min read Tweet this post

import (
  "unicode"
)

func isASCII(s string) bool {
	for _, c := range s {
		if c > unicode.MaxASCII {
			return false
		}
	}
	return true
}

// isASCII("홍길동") false
// isASCII("Hong Gildong") true

Use case: If I have duplicate email but different name, so prefer the native string or ASCII string we can easily determine which name we use

  • González vs Gonzalez - pick González
  • Yamada vs 山田 - pick 山田
  • 홍길동 vs Hong Gildong - pick 홍길동
Section titled Normalize%20normal%20alphanumeric%20characters%20if%20using%20non%20ASCII

Normalize normal alphanumeric characters if using non ASCII

import (
  "unicode"
	"golang.org/x/text/unicode/norm"
)

func isASCII(s string) bool {
	for _, c := range s {
		if c > unicode.MaxASCII {
			return false
		}
	}
	return true
}

func Normalize(s string) string {
  if !isASCII(s){
    return norm.NFKD.String(s)
  }
	return s
}

// Normalize("James") -> James

Use case: if the string is more commonly written in the common ASCII set of characters, prefer the ASCII version.

snippets go