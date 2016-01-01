Natural Language Processing (NLP) Functions

The detectCharset function detects the character set of the non-UTF8-encoded input string.

Syntax

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

A String containing the code of the detected character set

Examples

Query:

Result:

Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection, and it returns the 2-letter ISO language code.

The detectLanguage function works best when providing over 200 characters in the input string.

Syntax

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

The 2-letter ISO code of the detected language

Other possible results:

un = unknown, can not detect any language.

Examples

Query:

Result:

Similar to the detectLanguage function, but detectLanguageMixed returns a Map of 2-letter language codes that are mapped to the percentage of the certain language in the text.

Syntax

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

Map(String, Float32) : The keys are 2-letter ISO codes and the values are a percentage of text found for that language

Examples

Query:

Result:

Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.

Syntax

Arguments

source_code — String representation of the source code to analyze. String.

Returned value

Programming language. String.

Examples

Query:

Result:

Similar to the detectLanguage function, except the detectLanguageUnknown function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.

Syntax

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

The 2-letter ISO code of the detected language

Other possible results:

un = unknown, can not detect any language.

Examples

Query:

Result:

Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from -12 to 6 . For each text, it calculates the average sentiment value of its words and returns it in the range [-1,1] .

Note This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at /contrib/nlp-data/tonality_ru.zst and only works for the Russian language.

Syntax

Arguments

text — The text to be analyzed. String.

Returned value

The average sentiment value of the words in text . Float32.

Examples

Query:

Result:

Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained here.

Syntax

Arguments

language — Language which rules will be applied. String.

— Language which rules will be applied. String. word — Word that needs to be lemmatized. Must be lowercase. String.

Examples

Query:

Result:

Configuration

This configuration specifies that the dictionary en.bin should be used for lemmatization of English ( en ) words. The .bin files can be downloaded from here.

Performs stemming on a given word.

Syntax

Arguments

language — Language which rules will be applied. Use the two letter ISO 639-1 code.

— Language which rules will be applied. Use the two letter ISO 639-1 code. word — word that needs to be stemmed. Must be in lowercase. String.

Examples

Query:

Result:

Supported languages for stem()

Note The stem() function uses the Snowball stemming library, see the Snowball website for updated languages etc.

Arabic

Armenian

Basque

Catalan

Danish

Dutch

English

Finnish

French

German

Greek

Hindi

Hungarian

Indonesian

Irish

Italian

Lithuanian

Nepali

Norwegian

Porter

Portuguese

Romanian

Russian

Serbian

Spanish

Swedish

Tamil

Turkish

Yiddish

Finds synonyms to a given word. There are two types of synonym extensions: plain and wordnet .

With the plain extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.

With the wordnet extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.

Syntax

Arguments

extension_name — Name of the extension in which search will be performed. String.

— Name of the extension in which search will be performed. String. word — Word that will be searched in extension. String.

Examples

Query:

Result:

Configuration