Natural Language Processing (NLP) Functions

detectCharset

Introduced in: v22.2.0 Detects the character set of a non-UTF8-encoded input string.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

detectCharset(s)

Arguments

s — The text to analyze. String

Returned value Returns a string containing the code of the detected character set String Examples Basic usage

Query

SELECT detectCharset('Ich bleibe für ein paar Tage.')

Response

WINDOWS-1252

detectLanguage

Introduced in: v22.2.0 Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection and returns the 2-letter ISO language code. The longer the input, the more precise the language detection will be.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

detectLanguage(s)

Arguments

text_to_be_analyzed — The text to analyze. String

Returned value Returns the 2-letter ISO code of the detected language. Other possible results: un = unknown, can not detect any language, other = the detected language does not have 2 letter code. String Examples Mixed language text

Query

SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there\'s a will, there\'s a way.')

Response

fr

detectLanguageMixed

Introduced in: v22.2.0 Similar to the detectLanguage function, but detectLanguageMixed returns a Map of 2-letter language codes that are mapped to the percentage of the certain language in the text.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

detectLanguageMixed(s)

Arguments

s — The text to analyze String

Returned value Returns a map with keys which are 2-letter ISO codes and corresponding values which are a percentage of the text found for that language Map(String, Float32) Examples Mixed languages

Query

SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.')

Response

{'ja':0.62,'fr':0.36}

detectLanguageUnknown

Introduced in: v22.2.0 Similar to the detectLanguage function, except the detectLanguageUnknown function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

detectLanguageUnknown('s')

Arguments

s — The text to analyze. String

Query

SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.')

Response

de

detectTonality

Introduced in: v22.2.0 Determines the sentiment of the provided text data.

LimitationThis function is limited in its current form in that it makes use of the embedded emotional dictionary and only works for the Russian language.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

detectTonality(s)

Arguments

s — The text to be analyzed. String

Returned value Returns the average sentiment value of the words in text Float32 Examples Russian sentiment analysis

Query

SELECT
    detectTonality('Шарик - хороший пёс'),
    detectTonality('Шарик - пёс'),
    detectTonality('Шарик - плохой пёс')

Response

0.44445, 0, -0.3

lemmatize

Introduced in: v21.9.0 Performs lemmatization on a given word. This function needs dictionaries to operate, which can be obtained from github. For more details on loading a dictionary from a local file see page “Defining Dictionaries”.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

lemmatize(lang, word)

Arguments

lang — Language which rules will be applied. String
word — Lowercase word that needs to be lemmatized. String

Returned value Returns the lemmatized form of the word String Examples English lemmatization

Query

SELECT lemmatize('en', 'wolves')

Response

wolf

stem

Introduced in: v21.9.0 Performs stemming on a word or an array of words using the Snowball algorithms. Each input string must be a single, lowercase word — strings containing whitespace cause an exception. Passing uppercase characters produces undefined results. Returns String for scalar inputs (including FixedString) and Array(String) for array inputs. Nullable and LowCardinality variants of String and FixedString are supported. The list of supported language identifiers is available in system.stemmers. Syntax

stem(word, language)

Arguments

word — A single lowercase word (or array of words) to stem. Must be lowercase — uppercase characters produce undefined results. Accepts String, FixedString, Array(String), Array(FixedString), Array(Nullable(String)), or Array(Nullable(FixedString)). String or FixedString or Array(String) or Array(FixedString)
language — Language whose stemming rules will be applied. The canonical identifiers are listed in system.stemmers (e.g. ‘english’, ‘german’, ‘porter’). Snowball also accepts 2- or 3-letter ISO 639 codes (e.g. ‘en’, ‘eng’) as aliases where defined, but coverage varies by language — prefer the names from system.stemmers for portability. String

Returned value The stemmed form of the word (String), or an array of stemmed words (Array(String)). String or Array(String) Examples Stemming a single word

Query

SELECT stem('blessing', 'en') AS res

Response

bless

Stemming an array of words

Query

SELECT stem(['blessing', 'disguise'], 'en') AS res

Response

['bless','disguis']

Stemming a FixedString

Query

SELECT stem(toFixedString('blessing', 10), 'en') AS res

Response

bless

Stemming a Nullable word

Query

SELECT stem(toNullable('blessing'), 'en') AS res

Response

bless

synonyms

Introduced in: v21.9.0 Finds synonyms of a given word. There are two types of synonym extensions:

plain
wordnet

With the plain extension type you need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters. With the wordnet extension type you need to provide a path to a directory with the WordNet thesaurus in it. The thesaurus must contain a WordNet sense index.

This function is experimental and may change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

Syntax

synonyms(ext_name, word)

Arguments

ext_name — Name of the extension in which search will be performed. String
word — Word that will be searched in extension. String

Returned value Returns array of synonyms for the given word. Array(String) Examples Find synonyms

Query

SELECT synonyms('list', 'important')

Response

['important','big','critical','crucial']

SQL Reference

Data Types

Engines

Functions

Formats

Settings

System Tables

Data Lakes

Natural Language Processing (NLP) Functions

detectCharset

detectLanguage

detectLanguageMixed

detectLanguageUnknown

detectTonality

lemmatize

stem

synonyms

​detectCharset

​detectLanguage

​detectLanguageMixed

​detectLanguageUnknown

​detectTonality

​lemmatize

​stem

​synonyms

detectCharset

detectLanguage

detectLanguageMixed

detectLanguageUnknown

detectTonality

lemmatize

stem

synonyms