com.carrotsearch.lingo3g
Class Lingo3GAttributesDescriptor.AttributeBuilder

java.lang.Object
  extended by com.carrotsearch.lingo3g.Lingo3GAttributesDescriptor.AttributeBuilder
Enclosing class:
Lingo3GAttributesDescriptor

public static class Lingo3GAttributesDescriptor.AttributeBuilder
extends Object

Attribute map builder for the Lingo3GAttributes component. You can use this builder as a type-safe alternative to populating the attribute map using attribute keys.


Field Summary
 Map<String,Object> map
          The attribute map populated by this builder.
 
Constructor Summary
protected Lingo3GAttributesDescriptor.AttributeBuilder(Map<String,Object> map)
          Creates a builder backed by the provided map.
 
Method Summary
 Lingo3GAttributesDescriptor.AttributeBuilder accentFolding(Boolean value)
          Converts national characters to ASCII counterparts.
 Lingo3GAttributesDescriptor.AttributeBuilder aggressiveCloningControl(Boolean value)
          Aggressive cluster cloning control switch.
 Lingo3GAttributesDescriptor.AttributeBuilder allowNumbersInLabels(Boolean value)
          Allow numbers in labels switch.
 Lingo3GAttributesDescriptor.AttributeBuilder allowOneDocumentClusters(Boolean value)
          When enabled, the algorithm will not prune clusters containing only one document.
 Lingo3GAttributesDescriptor.AttributeBuilder capitalizedWordLabelScorerWeight(Double value)
          Assigns higher scores to labels that contain capitalized words.
 Lingo3GAttributesDescriptor.AttributeBuilder capitalizeNonFunctionWords(Boolean value)
          Capitalize non function words in labels.
 Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(Class<? extends IStemmerFactory> clazz)
          Stemmer factory.
 Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(IStemmerFactory value)
          Stemmer factory.
 Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(Class<? extends ITokenizerFactory> clazz)
          Tokenizer factory.
 Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(ITokenizerFactory value)
          Tokenizer factory.
 Lingo3GAttributesDescriptor.AttributeBuilder cloningControl(Boolean value)
          Cluster cloning control switch.
 Lingo3GAttributesDescriptor.AttributeBuilder clusterCountBase(Integer value)
          The number of clusters discovered in each clustering pass.
 Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Class<? extends Lingo3GAttributes.ClusterScoringFields> clazz)
          Extra fields to use for cluster scoring.
 Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Lingo3GAttributes.ClusterScoringFields value)
          Extra fields to use for cluster scoring.
 Lingo3GAttributesDescriptor.AttributeBuilder clusterSetDocumentOverlapLabelScorerWeight(Double value)
          Assigns higher scores to labels that contain documents not present in the current cluster set.
 Lingo3GAttributesDescriptor.AttributeBuilder combinedClusterScoreBalance(Double value)
          Decides whether document count or cluster label score should have larger impact on the cluster score.
 Lingo3GAttributesDescriptor.AttributeBuilder contentFields(List<String> value)
          Content fields to use for clustering.
 Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsLabelFilter(Boolean value)
          Filters out labels containing words starting or ending in a dash character ('-').
 Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsSynonymMarkerEnabled(Boolean value)
          When switched on, the clustering engine will treat words separated by a space (' '), period ('.'), slash ('/') or a dash ('-') or written together and the corresponding phrases as synonymous, e.g.
 Lingo3GAttributesDescriptor.AttributeBuilder dictionaryLabelFilter(Boolean value)
          Removes or boosts labels based on a predefined dictionary of words, phrases and regular expressions.
 Lingo3GAttributesDescriptor.AttributeBuilder dictionarySynonymMarkerEnabled(Boolean value)
          When switched on, the clustering engine will apply synonyms defined in the synonyms.[lang].xml file.
 Lingo3GAttributesDescriptor.AttributeBuilder dictionaryWeightLabelScorerWeight(Double value)
          Boosts label scores by a factor specified in the label dictionary file.
 Lingo3GAttributesDescriptor.AttributeBuilder documentCountLabelScorerWeight(Double value)
          Assigns higher scores to clusters whose number of documents in relation to the total number of documents is equal or smaller than specified by the 'Maximum cluster size' parameter.
 Lingo3GAttributesDescriptor.AttributeBuilder documentCoverageTarget(Double value)
          The percentage of input documents to be put in clusters.
 Lingo3GAttributesDescriptor.AttributeBuilder extraktSynonymMarkerEnabled(Boolean value)
          When switched on, the clustering engine will apply synonyms obtained from the Extrakt linguistic engine.
 Lingo3GAttributesDescriptor.AttributeBuilder flatMerging(Boolean value)
          Flat merging switch.
 Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMerging(Boolean value)
          Hierarchical merging switch.
 Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMergingWithLabels(Boolean value)
          Label merging switch.
 Lingo3GAttributesDescriptor.AttributeBuilder labelOverrideThreshold(Double value)
          Determines the strength of the truncated label filters.
 Lingo3GAttributesDescriptor.AttributeBuilder languageRecognition(Boolean value)
          Language recognition switch.
 Lingo3GAttributesDescriptor.AttributeBuilder leftCompleteLabelFilter(Boolean value)
          Truncated labels filter.
 Lingo3GAttributesDescriptor.AttributeBuilder license(Class<? extends IResource> clazz)
          An explicit program license resource.
 Lingo3GAttributesDescriptor.AttributeBuilder license(IResource value)
          An explicit program license resource.
 Lingo3GAttributesDescriptor.AttributeBuilder lowercaseFunctionWords(Boolean value)
          Use lower case for function words in labels.
 Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesSub(Integer value)
          Maximum number of clustering passes to perform on subclusters.
 Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesTop(Integer value)
          Maximum number of clustering passes to perform on top hierarchy level.
 Lingo3GAttributesDescriptor.AttributeBuilder maxClusterSize(Double value)
          Determines the maximum allowed size of a cluster in relation to the parent cluster size.
 Lingo3GAttributesDescriptor.AttributeBuilder maxHierarchyDepth(Integer value)
          The maximum number of cluster levels to create.
 Lingo3GAttributesDescriptor.AttributeBuilder maxImprovementIterations(Integer value)
          The number of clustering improvement iterations to perform.
 Lingo3GAttributesDescriptor.AttributeBuilder maxLabelWords(Integer value)
          Determines the maximum label length in words.
 Lingo3GAttributesDescriptor.AttributeBuilder maxTokensPerDocument(Integer value)
          Maximum tokens per document to read.
 Lingo3GAttributesDescriptor.AttributeBuilder maxWordDf(Double value)
          Maximum word document frequency.
 Lingo3GAttributesDescriptor.AttributeBuilder mergeThreshold(Double value)
          Cluster merge threshold.
 Lingo3GAttributesDescriptor.AttributeBuilder minClusterSize(Double value)
          Determines the minimum allowed size of a cluster in relation to the parent cluster size.
 Lingo3GAttributesDescriptor.AttributeBuilder minClusterSizeForSubclusters(Integer value)
          The minimum number of documents that must be assigned to a cluster before the clustering engine attempts to create subclusters for that cluster.
 Lingo3GAttributesDescriptor.AttributeBuilder minLabelWords(Integer value)
          Determines the minimum label length in words.
 Lingo3GAttributesDescriptor.AttributeBuilder minLengthLabelFilter(Boolean value)
          Filters out labels whose string representation (excluding spaces) is shorter than 3 characters.
 Lingo3GAttributesDescriptor.AttributeBuilder neighborhoodSize(Integer value)
          Maximum similar clusterings to examine.
 Lingo3GAttributesDescriptor.AttributeBuilder normalizeScores(Boolean value)
          Cluster and label score normalization switch.
 Lingo3GAttributesDescriptor.AttributeBuilder numberOnlyLabelFilter(Boolean value)
          Filters out labels that consist only of numeric tokens.
 Lingo3GAttributesDescriptor.AttributeBuilder oneLetterWordLabelFilter(Boolean value)
          Filters out labels containing only one-letter words, e.g.
 Lingo3GAttributesDescriptor.AttributeBuilder phraseDfThesholdScalingFactor(Double value)
          Phrase-level Document Frequency (DF) cut-off scaling factor.
 Lingo3GAttributesDescriptor.AttributeBuilder preciseDocumentAssignment(Boolean value)
          When precise document assignment is switched off, clusters with multi word labels will contain all documents that contain the label's word in any order and at any position.
 Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLength(double value)
          Instructs the clustering engine to prefer cluster labels consisting of the specified number of words.
 Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLengthDeviation(double value)
          Allowed deviation from the preferred label length.
 Lingo3GAttributesDescriptor.AttributeBuilder putPromotedLabelsAtHierarchyRoot(Boolean value)
          Put promoted labels at hierarchy root.
 Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelScorerWeight(Double value)
          Penalizes labels that contain query words.
 Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelWeight(Double value)
          Determines the weight of labels containing query words.
 Lingo3GAttributesDescriptor.AttributeBuilder reloadResources(Boolean value)
          Forced resources reload switch.
 Lingo3GAttributesDescriptor.AttributeBuilder removeRepeatedSynonymsFromLabels(Boolean value)
          Remove repeated synonyms from labels.
 Lingo3GAttributesDescriptor.AttributeBuilder repeatedWordsLabelFilter(Boolean value)
          Filters out labels containing repeated words (e.g."New York York").
 Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(Class<? extends ResourceLookup> clazz)
           
 Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(ResourceLookup value)
           
 Lingo3GAttributesDescriptor.AttributeBuilder rightCompleteLabelFilter(Boolean value)
          Truncated labels filter.
 Lingo3GAttributesDescriptor.AttributeBuilder singleWordLabelWeight(Double value)
          Determines how willing the clustering engine will be to select single words as cluster labels.
 Lingo3GAttributesDescriptor.AttributeBuilder tfDfRatioLabelScorerWeight(Double value)
          Assigns higher score to more general/shorter labels.
 Lingo3GAttributesDescriptor.AttributeBuilder tfLabelScorerWeight(Double value)
          Assigns higher scores to labels with higher Term Frequency (TF).
 Lingo3GAttributesDescriptor.AttributeBuilder titleFields(List<String> value)
          Title fields to use for clustering.
 Lingo3GAttributesDescriptor.AttributeBuilder titleWordLabelScorerWeight(Double value)
          Assigns higher scores to labels that contain word that appeared in input documents' titles.
 Lingo3GAttributesDescriptor.AttributeBuilder trailingGenitiveLabelFilter(Boolean value)
          Filters out phrases ending in Saxon genitive of an English noun, e.g.
 Lingo3GAttributesDescriptor.AttributeBuilder unindexedWordLabelScorerWeight(Double value)
          Penalizes labels that contain too many function words.
 Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Class<? extends Lingo3GAttributes.UnknownWordHandlingStrategy> clazz)
          Handling of unknown words in persistent clusters.
 Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Lingo3GAttributes.UnknownWordHandlingStrategy value)
          Handling of unknown words in persistent clusters.
 Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForLabelFiltering(boolean value)
          Use built-in word database for label filtering.
 Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForStemming(boolean value)
          Use built-in word database for stemming.
 Lingo3GAttributesDescriptor.AttributeBuilder wordCountLabelScorerWeight(Double value)
          Assigns higher scores to labels that consist of 2, 3 or 4 words.
 Lingo3GAttributesDescriptor.AttributeBuilder wordDfThesholdScalingFactor(Double value)
          Word-level Document Frequency (DF) cut-off scaling factor.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

map

public final Map<String,Object> map
The attribute map populated by this builder.

Constructor Detail

Lingo3GAttributesDescriptor.AttributeBuilder

protected Lingo3GAttributesDescriptor.AttributeBuilder(Map<String,Object> map)
Creates a builder backed by the provided map.

Method Detail

reloadResources

public Lingo3GAttributesDescriptor.AttributeBuilder reloadResources(Boolean value)
Forced resources reload switch. Causes the clustering engine to reload lexical resources (stopwords, label dictionaries, synonyms etc.) on every clustering request. This is a debug-only switch, particularly useful when tuning lexical resources.

When running Lingo3G within Lingo3G Workbench, the lexical resources are loaded from the workspace subdirectory of the Lingo3G Workbench installation directory. If resource reloading is enabled, all changes made to the lexical resources will take effect immediately and will not require restarting Lingo3G Workbench.

Performance impact: very high. Make sure resource reloading is switched off in production settings.

See Also:
Lingo3GAttributes.reloadResources

resourceLookup

public Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(ResourceLookup value)
See Also:
Lingo3GAttributes.resourceLookup

resourceLookup

public Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(Class<? extends ResourceLookup> clazz)
See Also:
Lingo3GAttributes.resourceLookup

license

public Lingo3GAttributesDescriptor.AttributeBuilder license(IResource value)
An explicit program license resource. By default, the license is sought in a set of default locations. This attribute provides an explicit license to be used. If this attribute has a non-null value, default locations are not scanned.

See Also:
Lingo3GAttributes.license

license

public Lingo3GAttributesDescriptor.AttributeBuilder license(Class<? extends IResource> clazz)
An explicit program license resource. By default, the license is sought in a set of default locations. This attribute provides an explicit license to be used. If this attribute has a non-null value, default locations are not scanned.

See Also:
Lingo3GAttributes.license

maxHierarchyDepth

public Lingo3GAttributesDescriptor.AttributeBuilder maxHierarchyDepth(Integer value)
The maximum number of cluster levels to create. Setting this parameter to 1 will disable hierarchical clustering. In such case it is also recommended to disable hierarchical merging, which will preserve smaller clusters.

Performance impact: high

See Also:
Lingo3GAttributes.maxHierarchyDepth

clusterCountBase

public Lingo3GAttributesDescriptor.AttributeBuilder clusterCountBase(Integer value)
The number of clusters discovered in each clustering pass. The higher the value of this parameter, the larger the total number of clusters.

Performance impact: medium

See Also:
Lingo3GAttributes.clusterCountBase

maxClusteringPassesTop

public Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesTop(Integer value)
Maximum number of clustering passes to perform on top hierarchy level. Determines the maximum number of cluster discovery passes the clustering engine should perform to discover the top-level clusters. The first clustering pass discovers large/more general clusters, while further passes find smaller/more specific clusters. Setting the maximum number of passes to 0 will force the algorithm to stop clustering only when no more clusters can be created or the 'Document coverage target' has been reached.

Performance impact: high

Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of clusters.

See Also:
Lingo3GAttributes.maxClusteringPassesTop

maxClusteringPassesSub

public Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesSub(Integer value)
Maximum number of clustering passes to perform on subclusters. Determines the maximum number of cluster discovery passes the clustering engine should perform to discover subclusters. The first clustering pass discovers large/more general clusters, while further passes find smaller/more specific clusters. Setting the maximum number of passes to 0 will force the algorithm to stop clustering only when no more subclusters can be created or the 'Document coverage target' has been reached for the parent cluster.

Performance impact: high

Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of subclusters for each cluster.

See Also:
Lingo3GAttributes.maxClusteringPassesSub

documentCoverageTarget

public Lingo3GAttributesDescriptor.AttributeBuilder documentCoverageTarget(Double value)
The percentage of input documents to be put in clusters. Determines the percentage of documents the clustering engine should assign to clusters. After each clustering pass, the clustering engine will check if the required document coverage has been achieved. If so, it will not perform further clustering passes. The required document coverage may not always be achieved, especially if the maximum number of clustering passes is set to a low value. To cause the clustering engine to always perform the maximum number of clustering passes, set the value of this parameter to 1.0.

Performance impact: high

See Also:
Lingo3GAttributes.documentCoverageTarget

maxImprovementIterations

public Lingo3GAttributesDescriptor.AttributeBuilder maxImprovementIterations(Integer value)
The number of clustering improvement iterations to perform. Determines the maximum number of clustering improvement cycles the clustering engine should perform. During each cycle, it will examine clusterings similar to the current one, and if any of them is better, the current cluster arrangement will be replaced.

Performance impact: very high

See Also:
Lingo3GAttributes.maxImprovementIterations

neighborhoodSize

public Lingo3GAttributesDescriptor.AttributeBuilder neighborhoodSize(Integer value)
Maximum similar clusterings to examine. Determines the maximum number of similar clusterings the clustering engine should examine during each improvement cycle. This parameter is meaningful only when 'Maximum improvement iterations' is greater than 0.

Performance impact: very high

See Also:
Lingo3GAttributes.neighborhoodSize

mergeThreshold

public Lingo3GAttributesDescriptor.AttributeBuilder mergeThreshold(Double value)
Cluster merge threshold. If the overlap between clusters is larger than the value of this parameter, these clusters will be merged.

Performance impact: none

Results impact: Low values of this parameter will cause the clustering engine to eagerly merge clusters, which will create larger clusters in which some documents may be irrelevant. High values of this parameter will cause it to merge clusters rarely, which will result in large numbers of small clusters with more relevant documents.

See Also:
Lingo3GAttributes.mergeThreshold

flatMerging

public Lingo3GAttributesDescriptor.AttributeBuilder flatMerging(Boolean value)
Flat merging switch. When switched on, the clustering engine will perform cluster merging using a strategy specific for flat (non-hierarchical) clusters. With this strategy the clustering engine will merge only clusters of similar size.

Performance impact: low

See Also:
Lingo3GAttributes.flatMerging

hierarchicalMerging

public Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMerging(Boolean value)
Hierarchical merging switch. When switched on, the clustering engine will use a cluster merging strategy specially designed for hierarchical clustering, and will be more eager to move clusters from the top level positions to subclusters. If the algorithm is set to perform flat clustering (max-hierarchy-depth = 1), disabling hierarchical merging is recommended to preserve smaller clusters.

Performance impact: low

See Also:
Lingo3GAttributes.hierarchicalMerging

hierarchicalMergingWithLabels

public Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMergingWithLabels(Boolean value)
Label merging switch. When switched on, the clustering engine will take cluster labels into account while hierarchical merging of clusters. This parameter is meaningful only when 'Hierarchical merging' is switched on.

Performance impact: low

Results impact: With label merging switched on, the clustering engine may move some additional clusters from the top level to subclusters.

See Also:
Lingo3GAttributes.hierarchicalMergingWithLabels

cloningControl

public Lingo3GAttributesDescriptor.AttributeBuilder cloningControl(Boolean value)
Cluster cloning control switch. When switched on, the clustering engine will not allow the same cluster label to appear both at the top- and subcluster-level of the hierarchy.

Performance impact: low

See Also:
Lingo3GAttributes.cloningControl

aggressiveCloningControl

public Lingo3GAttributesDescriptor.AttributeBuilder aggressiveCloningControl(Boolean value)
Aggressive cluster cloning control switch. When switched on, the clustering engine will not allow the same label to appear at any level of the hierarchy. This parameter is meaningful only if 'Cluster cloning control' is switched on.

Performance impact: low

See Also:
Lingo3GAttributes.aggressiveCloningControl

minClusterSize

public Lingo3GAttributesDescriptor.AttributeBuilder minClusterSize(Double value)
Determines the minimum allowed size of a cluster in relation to the parent cluster size. E.g. a value of 0.4 means that clusters must not contain less than 40% of the parent cluster's documents (of all documents in case of top-level clusters). This parameter is meaningful only if 'Document count label scorer weight' is greater than 0.

Performance impact: none

See Also:
Lingo3GAttributes.minClusterSize

maxClusterSize

public Lingo3GAttributesDescriptor.AttributeBuilder maxClusterSize(Double value)
Determines the maximum allowed size of a cluster in relation to the parent cluster size. E.g. a value of 0.4 means that clusters must not contain more than 40% of the parent cluster's documents (of all documents in case of top-level clusters). This parameter is meaningful only if 'Document count label scorer weight' is greater than 0.

Performance impact: none

See Also:
Lingo3GAttributes.maxClusterSize

combinedClusterScoreBalance

public Lingo3GAttributesDescriptor.AttributeBuilder combinedClusterScoreBalance(Double value)
Decides whether document count or cluster label score should have larger impact on the cluster score. Setting this parameter to 0.5 will cause the clustering engine to assign equal weight to document count and cluster label score during cluster score calculation. A value equal to 1.0 will cause the clustering engine to use only document count for cluster scoring. Similarly, with the 0.0 value, only the cluster label score will be used.

Performance impact: none

See Also:
Lingo3GAttributes.combinedClusterScoreBalance

minClusterSizeForSubclusters

public Lingo3GAttributesDescriptor.AttributeBuilder minClusterSizeForSubclusters(Integer value)
The minimum number of documents that must be assigned to a cluster before the clustering engine attempts to create subclusters for that cluster.

Performance impact: high

See Also:
Lingo3GAttributes.minClusterSizeForSubclusters

preciseDocumentAssignment

public Lingo3GAttributesDescriptor.AttributeBuilder preciseDocumentAssignment(Boolean value)
When precise document assignment is switched off, clusters with multi word labels will contain all documents that contain the label's word in any order and at any position. When precise document assignment is switched on, only documents containing all cluster label's words close to each other will be placed in the cluster.

Performance impact: high

See Also:
Lingo3GAttributes.preciseDocumentAssignment

normalizeScores

public Lingo3GAttributesDescriptor.AttributeBuilder normalizeScores(Boolean value)
Cluster and label score normalization switch. When switched on, the clustering engine will normalize cluster and label scores so that they fall in the 0.0 to 1.0 range.

Performance impact: none

Results impact: As the value of this parameter does not have any impact on the order and structure of clusters generated by the clustering engine, this switch will be useful only for applications that depend on absolute values of cluster or label scores.

See Also:
Lingo3GAttributes.normalizeScores

allowOneDocumentClusters

public Lingo3GAttributesDescriptor.AttributeBuilder allowOneDocumentClusters(Boolean value)
When enabled, the algorithm will not prune clusters containing only one document.

Tip: For collections larger than 100 documents, to get one-document clusters, you also need to set Lingo3GAttributes.wordDfThesholdScalingFactor and Lingo3GAttributes.phraseDfThesholdScalingFactor to 0.0.

Tip: When one-document clusters are allowed, the number of larger clusters may decrease. To obtain more larger clusters while keeping the one-document ones, increase Lingo3GAttributes.maxClusteringPassesTop and Lingo3GAttributes.maxClusteringPassesSub or set them to 0.

Performance impact: medium.

See Also:
Lingo3GAttributes.allowOneDocumentClusters

singleWordLabelWeight

public Lingo3GAttributesDescriptor.AttributeBuilder singleWordLabelWeight(Double value)
Determines how willing the clustering engine will be to select single words as cluster labels. The higher the value of this parameter, the more clusters described with single-word labels will be produced.

Performance impact: none

See Also:
Lingo3GAttributes.singleWordLabelWeight

minLabelWords

public Lingo3GAttributesDescriptor.AttributeBuilder minLabelWords(Integer value)
Determines the minimum label length in words. Labels consisting of fewer words will not be generated.

Performance impact: none

Results impact: Setting the minimum label length to some higher value (e.g. 4 or 5) may create more specific clusters.

See Also:
Lingo3GAttributes.minLabelWords

maxLabelWords

public Lingo3GAttributesDescriptor.AttributeBuilder maxLabelWords(Integer value)
Determines the maximum label length in words. Labels consisting of more words will not be generated.

Performance impact: none

Results impact: Setting the maximum label length to some lower value (e.g. 2 or 3) may create more general clusters.

This setting can also be useful when the input collection contains duplicate documents. In such cases, Lingo3G may create overlong cluster labels taken directly from the duplicate documents. While the best solution to this problem would be eliminating duplicate documents from input, lowering the maximum label length can serve as a simple workaround.

See Also:
Lingo3GAttributes.maxLabelWords

preferredLabelLength

public Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLength(double value)
Instructs the clustering engine to prefer cluster labels consisting of the specified number of words. The strength of the preference is determined by the Lingo3GAttributes.preferredLabelLengthDeviation attribute.

Fractional preferred label lengths are also allowed. For example, preferred label length of 2.5 will result in labels of length 2 and 3 being treated equally preferred; a value of 2.2 will prefer two-word labels more than three-word ones.

Performance impact: none

See Also:
Lingo3GAttributes.preferredLabelLength

preferredLabelLengthDeviation

public Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLengthDeviation(double value)
Allowed deviation from the preferred label length. Determines how far the clustering engine is allowed to deviate from the Lingo3GAttributes.preferredLabelLength. A value of 0.0 allows no deviation: all labels must have the preferred length. Larger values allow more and more deviation, with the value of 20.0 meaning almost no preference at all.

When the preferred label length deviation is 0.0 and the fractional part of the preferred label length is 0.5, then the only allowed label lengths will be the two integers closest to the preferred label length value. For example, if preferred label length deviation is 0.0 and preferred label length is 2.5, the clustering engine will create only labels consisting of 2 or 3 words. If the fractional part of the preferred label length is other than 0.5, only the closest integer label length will be preferred.

Performance impact: none

See Also:
Lingo3GAttributes.preferredLabelLengthDeviation

labelOverrideThreshold

public Lingo3GAttributesDescriptor.AttributeBuilder labelOverrideThreshold(Double value)
Determines the strength of the truncated label filters. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.

Performance impact: low

See Also:
Lingo3GAttributes.labelOverrideThreshold

queryWordLabelWeight

public Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelWeight(Double value)
Determines the weight of labels containing query words. Lower values mean that phrases containing query words are less likely to appear as cluster labels. In particular, the value of 0.0 will totally eliminate query words from cluster labels. The value of 1.0, on the other hand, will cause the clustering engine to treat equally labels with and without query words.

Performance impact: low

See Also:
Lingo3GAttributes.queryWordLabelWeight

allowNumbersInLabels

public Lingo3GAttributesDescriptor.AttributeBuilder allowNumbersInLabels(Boolean value)
Allow numbers in labels switch. When switched on, the clustering engine will allow numbers to appear in cluster labels.

Performance impact: low

See Also:
Lingo3GAttributes.allowNumbersInLabels

lowercaseFunctionWords

public Lingo3GAttributesDescriptor.AttributeBuilder lowercaseFunctionWords(Boolean value)
Use lower case for function words in labels. When switched on, the clustering engine will convert all function words in labels into lower case. When switched off, particular function words will appear in labels in the case they appeared in the majority of input documents.

Performance impact: low

See Also:
Lingo3GAttributes.lowercaseFunctionWords

capitalizeNonFunctionWords

public Lingo3GAttributesDescriptor.AttributeBuilder capitalizeNonFunctionWords(Boolean value)
Capitalize non function words in labels. When switched on, the clustering engine will capitalize all non function words in labels. When switched off, particular words will appear in labels in the case they appeared in the majority of input documents.

Performance impact: low

See Also:
Lingo3GAttributes.capitalizeNonFunctionWords

removeRepeatedSynonymsFromLabels

public Lingo3GAttributesDescriptor.AttributeBuilder removeRepeatedSynonymsFromLabels(Boolean value)
Remove repeated synonyms from labels. When switched on, no synonymous words will appear in a single label. For example, if 'photos' and 'pictures' are declared synonyms, labels such as 'Tiger Photos Pictures" or "Photos and Pictures" will not be generated.

Performance impact: low

See Also:
Lingo3GAttributes.removeRepeatedSynonymsFromLabels

putPromotedLabelsAtHierarchyRoot

public Lingo3GAttributesDescriptor.AttributeBuilder putPromotedLabelsAtHierarchyRoot(Boolean value)
Put promoted labels at hierarchy root. When switched on, labels promoted using the label dictionary will be always put at the top level of the cluster hierarchy. When switched off, promoted labels will not be forced to appear at the hierarchy root and will be placed where they naturally belong, e.g. as subclusters of larger clusters.

Results impact: a lot of labels can get promoted as a result of boosting e.g. proper nouns defined in the built-in POS database. With this option enabled, all such labels will be put at the root of cluster hierarchy, which may result in a clearly visible cluster overlap. For example, clusters Bill Clinton, President Bill Clinton and U.S. President Bill Clinton will all show at the root of the cluster tree, while with this option disabled, only the Bill Clinton cluster would be placed at root of the hierarchy.

Performance impact: low

See Also:
Lingo3GAttributes.putPromotedLabelsAtHierarchyRoot

maxTokensPerDocument

public Lingo3GAttributesDescriptor.AttributeBuilder maxTokensPerDocument(Integer value)
Maximum tokens per document to read. Determines the maximum number of tokens (words) the clustering engine will read from each input document. When this parameter is set to 0, all tokens will be read.

Performance impact: high

See Also:
Lingo3GAttributes.maxTokensPerDocument

wordDfThesholdScalingFactor

public Lingo3GAttributesDescriptor.AttributeBuilder wordDfThesholdScalingFactor(Double value)
Word-level Document Frequency (DF) cut-off scaling factor. Determines how fast the word DF cut-off should grow with the increase of the number of documents. A value of 1.0 means that the word DF cut-off will increase by 1.0 per every 100 documents. Thus, for 100 documents the word DF cut-off will be 1.0, for 200 documents it will be 2.0, for 350 documents it will be 3.5 etc.

Performance impact: very high

Results impact: Setting low values for this parameter will preserve infrequent words, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.

See Also:
Lingo3GAttributes.wordDfThesholdScalingFactor

phraseDfThesholdScalingFactor

public Lingo3GAttributesDescriptor.AttributeBuilder phraseDfThesholdScalingFactor(Double value)
Phrase-level Document Frequency (DF) cut-off scaling factor. Determines how fast the phrase DF cut-off should grow with the increase of the number of documents. A value of 0.2 means that the phrase DF cut-off will increase by 0.2 per every 100 documents. Thus, for 100 documents the word DF cut-off will be 1.0, for 200 documents it will be 1.2, for 600 documents it will be 2.0 etc.

Performance impact: very high

Results impact: Setting low values for this parameter will preserve infrequent phrases, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.

See Also:
Lingo3GAttributes.phraseDfThesholdScalingFactor

accentFolding

public Lingo3GAttributesDescriptor.AttributeBuilder accentFolding(Boolean value)
Converts national characters to ASCII counterparts. When accent folding is switched on, all national characters (e.g. 'ü', 'ç', 'ó') will be internally replaced with their ASCII counterparts ('u', 'c', 'o'), which will make e.g. the words "Bücher" and "Bucher" equivalent. Please note that this is an instance-level parameter and changes of its value at request time will not be respected.

Performance impact: high

See Also:
Lingo3GAttributes.accentFolding

languageRecognition

public Lingo3GAttributesDescriptor.AttributeBuilder languageRecognition(Boolean value)
Language recognition switch. When switched on, for those input documents that do not have the Document.LANGUAGE field set, the clustering engine will attempt to recognize their language. If a document already has the Document.LANGUAGE set, it will be used for further processing.

Performance impact: medium

See Also:
Lingo3GAttributes.languageRecognition

trailingGenitiveLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder trailingGenitiveLabelFilter(Boolean value)
Filters out phrases ending in Saxon genitive of an English noun, e.g. "Discover World's", "For your computers'".

Performance impact: low

See Also:
Lingo3GAttributes.trailingGenitiveLabelFilter

numberOnlyLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder numberOnlyLabelFilter(Boolean value)
Filters out labels that consist only of numeric tokens.

Performance impact: low

See Also:
Lingo3GAttributes.numberOnlyLabelFilter

dashedWordsLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsLabelFilter(Boolean value)
Filters out labels containing words starting or ending in a dash character ('-').

Performance impact: low

See Also:
Lingo3GAttributes.dashedWordsLabelFilter

oneLetterWordLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder oneLetterWordLabelFilter(Boolean value)
Filters out labels containing only one-letter words, e.g. "M a f".

Performance impact: low

See Also:
Lingo3GAttributes.oneLetterWordLabelFilter

minLengthLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder minLengthLabelFilter(Boolean value)
Filters out labels whose string representation (excluding spaces) is shorter than 3 characters.

Performance impact: low

See Also:
Lingo3GAttributes.minLengthLabelFilter

repeatedWordsLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder repeatedWordsLabelFilter(Boolean value)
Filters out labels containing repeated words (e.g."New York York").

Performance impact: low

See Also:
Lingo3GAttributes.repeatedWordsLabelFilter

dictionaryLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder dictionaryLabelFilter(Boolean value)
Removes or boosts labels based on a predefined dictionary of words, phrases and regular expressions. Impact on performance depends on the number of regular expression entries in the label dictionary -- the more regular expression entries, the lower the processing speed.

Performance impact: medium to very high

See Also:
Lingo3GAttributes.dictionaryLabelFilter

leftCompleteLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder leftCompleteLabelFilter(Boolean value)
Truncated labels filter. Heuristically eliminates truncated cluster labels (e.g. "York Restaurants"), replacing them with complete phrases, e.g. "New York Restaurants", based on the context. It is recommended to use this filter in combination with 'Right complete label filter' . Strength of truncated label elimination determined by the 'Label override threshold' parameter.

Performance impact: medium

See Also:
Lingo3GAttributes.leftCompleteLabelFilter

rightCompleteLabelFilter

public Lingo3GAttributesDescriptor.AttributeBuilder rightCompleteLabelFilter(Boolean value)
Truncated labels filter. Heuristically eliminates truncated cluster labels (e.g. "York Restaurants"), replacing them with complete phrases, e.g. "New York Restaurants", based on the context. It is recommended to use this filter in combination with 'Left complete label filter' . Strength of truncated label elimination is determined by the 'Label override threshold' parameter.

Performance impact: medium

See Also:
Lingo3GAttributes.rightCompleteLabelFilter

dictionaryWeightLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder dictionaryWeightLabelScorerWeight(Double value)
Boosts label scores by a factor specified in the label dictionary file. If this scorer has weight 0, label boosting will not be applied.

Performance impact: low

See Also:
Lingo3GAttributes.dictionaryWeightLabelScorerWeight

wordCountLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder wordCountLabelScorerWeight(Double value)
Assigns higher scores to labels that consist of 2, 3 or 4 words. Longer labels are penalized -- the longer the label, the higher the penalty.

Performance impact: low

See Also:
Lingo3GAttributes.wordCountLabelScorerWeight

titleWordLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder titleWordLabelScorerWeight(Double value)
Assigns higher scores to labels that contain word that appeared in input documents' titles.

Performance impact: low

See Also:
Lingo3GAttributes.titleWordLabelScorerWeight

capitalizedWordLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder capitalizedWordLabelScorerWeight(Double value)
Assigns higher scores to labels that contain capitalized words.

Performance impact: low

See Also:
Lingo3GAttributes.capitalizedWordLabelScorerWeight

maxWordDf

public Lingo3GAttributesDescriptor.AttributeBuilder maxWordDf(Double value)
Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored.

For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Performance impact: low

See Also:
Lingo3GAttributes.maxWordDf

unindexedWordLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder unindexedWordLabelScorerWeight(Double value)
Penalizes labels that contain too many function words.

Performance impact: low

See Also:
Lingo3GAttributes.unindexedWordLabelScorerWeight

queryWordLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelScorerWeight(Double value)
Penalizes labels that contain query words.

Performance impact: low

See Also:
Lingo3GAttributes.queryWordLabelScorerWeight

tfDfRatioLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder tfDfRatioLabelScorerWeight(Double value)
Assigns higher score to more general/shorter labels.

Performance impact: low

See Also:
Lingo3GAttributes.tfDfRatioLabelScorerWeight

tfLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder tfLabelScorerWeight(Double value)
Assigns higher scores to labels with higher Term Frequency (TF).

Performance impact: low

See Also:
Lingo3GAttributes.tfLabelScorerWeight

documentCountLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder documentCountLabelScorerWeight(Double value)
Assigns higher scores to clusters whose number of documents in relation to the total number of documents is equal or smaller than specified by the 'Maximum cluster size' parameter.

Performance impact: low

See Also:
Lingo3GAttributes.documentCountLabelScorerWeight

clusterSetDocumentOverlapLabelScorerWeight

public Lingo3GAttributesDescriptor.AttributeBuilder clusterSetDocumentOverlapLabelScorerWeight(Double value)
Assigns higher scores to labels that contain documents not present in the current cluster set.

Performance impact: low

See Also:
Lingo3GAttributes.clusterSetDocumentOverlapLabelScorerWeight

dictionarySynonymMarkerEnabled

public Lingo3GAttributesDescriptor.AttributeBuilder dictionarySynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will apply synonyms defined in the synonyms.[lang].xml file.

Performance impact: medium

See Also:
Lingo3GAttributes.dictionarySynonymMarkerEnabled

dashedWordsSynonymMarkerEnabled

public Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsSynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will treat words separated by a space (' '), period ('.'), slash ('/') or a dash ('-') or written together and the corresponding phrases as synonymous, e.g. "data-mining", "data.mining", "datamining", "data/mining" and "data mining".

Performance impact: medium

See Also:
Lingo3GAttributes.dashedWordsSynonymMarkerEnabled

extraktSynonymMarkerEnabled

public Lingo3GAttributesDescriptor.AttributeBuilder extraktSynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will apply synonyms obtained from the Extrakt linguistic engine. This option is applicable only when the Extrakt engine is available, ignored otherwise.

Performance impact: high

See Also:
Lingo3GAttributes.extraktSynonymMarkerEnabled

titleFields

public Lingo3GAttributesDescriptor.AttributeBuilder titleFields(List<String> value)
Title fields to use for clustering. Specifies the list of document field names that provide the content for clustering. Depending on the value of the title-word-label-scorer-weight attribute, content of fields provided in this attribute can be given more weight during clustering.

See Also:
Lingo3GAttributes.titleFields

contentFields

public Lingo3GAttributesDescriptor.AttributeBuilder contentFields(List<String> value)
Content fields to use for clustering. Specifies the list of document field names that provide the content for clustering. As opposed to the title-fields attribute, fields provided in this attribute will not be given any extra weight during clustering.

See Also:
Lingo3GAttributes.contentFields

clusterScoringFields

public Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Lingo3GAttributes.ClusterScoringFields value)
Extra fields to use for cluster scoring. If your input data contains structured data in addition to unstructured text, you can use the structured data to guide Lingo3G towards creating clusters having some specific properties.

Usage scenario

For example, let us assume your data describes e-commerce products and has the following fields:

While Lingo3G will draw cluster labels from the unstructured text of the title and description fields, it can also use the the structured data to e.g. (see below for formal syntax specification):

Syntax

Cluster scoring field specification has the following form:

field:type:scoring:weight

where:

You can use commas to perform cluster scoring based on more than one field, e.g.:

field1:type1:scoring1:weight1, field2:type2:scoring2:weight2, ...

Adding extra fields to Carrot2 input XML

You can specify the extra field in Carrot2 XML documents using the field tag in the following way:

<document>
  <title>Canon 5D</title>
  <snippet>21MP camera</snippet>
  <url></url>
  <field key="price"><value type="java.lang.Double" value="149.90" /></field>
  <field key="votes"><value type="java.lang.Integer" value="4370" /></field>
  <field key="category"><value type="java.lang.String" value="Photo" /></field>
</document>

See Also:
Lingo3GAttributes.clusterScoringFields

clusterScoringFields

public Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Class<? extends Lingo3GAttributes.ClusterScoringFields> clazz)
Extra fields to use for cluster scoring. If your input data contains structured data in addition to unstructured text, you can use the structured data to guide Lingo3G towards creating clusters having some specific properties.

Usage scenario

For example, let us assume your data describes e-commerce products and has the following fields:

While Lingo3G will draw cluster labels from the unstructured text of the title and description fields, it can also use the the structured data to e.g. (see below for formal syntax specification):

Syntax

Cluster scoring field specification has the following form:

field:type:scoring:weight

where:

You can use commas to perform cluster scoring based on more than one field, e.g.:

field1:type1:scoring1:weight1, field2:type2:scoring2:weight2, ...

Adding extra fields to Carrot2 input XML

You can specify the extra field in Carrot2 XML documents using the field tag in the following way:

<document>
  <title>Canon 5D</title>
  <snippet>21MP camera</snippet>
  <url></url>
  <field key="price"><value type="java.lang.Double" value="149.90" /></field>
  <field key="votes"><value type="java.lang.Integer" value="4370" /></field>
  <field key="category"><value type="java.lang.String" value="Photo" /></field>
</document>

See Also:
Lingo3GAttributes.clusterScoringFields

unknownWordHandlingStrategy

public Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Lingo3GAttributes.UnknownWordHandlingStrategy value)
Handling of unknown words in persistent clusters. Defines how Lingo3G should treat unknown words in labels of persistent clusters. A word is unknown when it occurs in the persistent cluster's label but it is not present in any of the documents being clustered.

The two available options are:

Performance impact: none

See Also:
Lingo3GAttributes.unknownWordHandlingStrategy

unknownWordHandlingStrategy

public Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Class<? extends Lingo3GAttributes.UnknownWordHandlingStrategy> clazz)
Handling of unknown words in persistent clusters. Defines how Lingo3G should treat unknown words in labels of persistent clusters. A word is unknown when it occurs in the persistent cluster's label but it is not present in any of the documents being clustered.

The two available options are:

Performance impact: none

See Also:
Lingo3GAttributes.unknownWordHandlingStrategy

carrot2StemmerFactory

public Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(IStemmerFactory value)
Stemmer factory. Creates the stemmers to be used by the clustering algorithm.

See Also:
Lingo3GAttributes.carrot2StemmerFactory

carrot2StemmerFactory

public Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(Class<? extends IStemmerFactory> clazz)
Stemmer factory. Creates the stemmers to be used by the clustering algorithm.

See Also:
Lingo3GAttributes.carrot2StemmerFactory

carrot2TokenizerFactory

public Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(ITokenizerFactory value)
Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm.

See Also:
Lingo3GAttributes.carrot2TokenizerFactory

carrot2TokenizerFactory

public Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(Class<? extends ITokenizerFactory> clazz)
Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm.

See Also:
Lingo3GAttributes.carrot2TokenizerFactory

useBuiltInWordDatabaseForLabelFiltering

public Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForLabelFiltering(boolean value)
Use built-in word database for label filtering. If enabled, Lingo3G will perform label filtering based on the the built-in word databases in addition to the word dictionary XML files. Currently, a built-in word database is available only for the English language.

Results impact: If this option is enabled, Lingo3G should produce better-formed cluster labels. For example, labels being, starting or ending with a verb or adjective should appear less frequently. However, because of the limitations of the current part of speech tagging model (please see below), enabling this option is also likely to prevent certain well-formed cluster labels, e.g. if the built-in word database misinterprets a noun for a verb.

Limitations of the part of speech tagging model. Currently, Lingo3G uses a unigram model for assigning part of speech tags to words. This means that for each word having multiple part of speech tags (such as "program" in English, which, depending on the context, can be both a verb and a noun), one of the available tags needs to be chosen. To do that, Lingo3G employs a heuristic that takes into account the word frequency and the set of part of speech tags the word has. While the heuristic is fairly efficient in a general, some words may be tagged erroneously. To provide a solution for such cases, the built-in part of speech database tags can be overridden in the user-defined XML word dictionary.

Performance impact: small.

See Also:
Lingo3GAttributes.useBuiltInWordDatabaseForLabelFiltering

useBuiltInWordDatabaseForStemming

public Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForStemming(boolean value)
Use built-in word database for stemming. If enabled, Lingo3G will use the word inflection database rather than an algorithmic stemmer. Currently, word inflection database is available only for the English language.

Stemmers or word inflection databases transform various form of a word to one common root. This is required to make sure that a cluster labeled e.g. Programming contains documents referencing all variants of the word, such as programs, programmer or programmed.

Results impact: Algorithmic stemming tends to be more aggressive compared to stemming based on word inflection dictionaries shipping with Lingo3G. This means that with algorithmic stemming all the following forms: program, programming, programmer and programmable will be treated as the same concept, while with the word database based stemming, they will be treated as separate, different concepts. As a result, with algorithmic stemming, a cluster labeled Program will contain documents referring to all program, programs, programming programmer and programmable, while with the word database based stemming, the cluster will contain only documents referring to program and programs.

Enabling this option is recommended only when it is important do distinguish between slight variations of the same general concept, e.g. programming and program.

Performance impact: small.

See Also:
Lingo3GAttributes.useBuiltInWordDatabaseForStemming


Copyright (c) Dawid Weiss, Stanislaw Osinski