com.carrotsearch.lingo3g
Class Lingo3GAttributesDescriptor.AttributeBuilder
java.lang.Object
com.carrotsearch.lingo3g.Lingo3GAttributesDescriptor.AttributeBuilder
- Enclosing class:
- Lingo3GAttributesDescriptor
public static class Lingo3GAttributesDescriptor.AttributeBuilder
- extends Object
Attribute map builder for the Lingo3GAttributes component. You can use this
builder as a type-safe alternative to populating the attribute map using attribute keys.
|
Method Summary |
Lingo3GAttributesDescriptor.AttributeBuilder |
accentFolding(Boolean value)
Converts national characters to ASCII counterparts. |
Lingo3GAttributesDescriptor.AttributeBuilder |
aggressiveCloningControl(Boolean value)
Aggressive cluster cloning control switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
allowNumbersInLabels(Boolean value)
Allow numbers in labels switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
allowOneDocumentClusters(Boolean value)
When enabled, the algorithm will not prune clusters containing only one document. |
Lingo3GAttributesDescriptor.AttributeBuilder |
capitalizedWordLabelScorerWeight(Double value)
Assigns higher scores to labels that contain capitalized words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
capitalizeNonFunctionWords(Boolean value)
Capitalize non function words in labels. |
Lingo3GAttributesDescriptor.AttributeBuilder |
carrot2StemmerFactory(Class<? extends IStemmerFactory> clazz)
Stemmer factory. |
Lingo3GAttributesDescriptor.AttributeBuilder |
carrot2StemmerFactory(IStemmerFactory value)
Stemmer factory. |
Lingo3GAttributesDescriptor.AttributeBuilder |
carrot2TokenizerFactory(Class<? extends ITokenizerFactory> clazz)
Tokenizer factory. |
Lingo3GAttributesDescriptor.AttributeBuilder |
carrot2TokenizerFactory(ITokenizerFactory value)
Tokenizer factory. |
Lingo3GAttributesDescriptor.AttributeBuilder |
cloningControl(Boolean value)
Cluster cloning control switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
clusterCountBase(Integer value)
The number of clusters discovered in each clustering pass. |
Lingo3GAttributesDescriptor.AttributeBuilder |
clusterScoringFields(Class<? extends Lingo3GAttributes.ClusterScoringFields> clazz)
Extra fields to use for cluster scoring. |
Lingo3GAttributesDescriptor.AttributeBuilder |
clusterScoringFields(Lingo3GAttributes.ClusterScoringFields value)
Extra fields to use for cluster scoring. |
Lingo3GAttributesDescriptor.AttributeBuilder |
clusterSetDocumentOverlapLabelScorerWeight(Double value)
Assigns higher scores to labels that contain documents not present in the current
cluster set. |
Lingo3GAttributesDescriptor.AttributeBuilder |
combinedClusterScoreBalance(Double value)
Decides whether document count or cluster label score should have larger impact on
the cluster score. |
Lingo3GAttributesDescriptor.AttributeBuilder |
contentFields(List<String> value)
Content fields to use for clustering. |
Lingo3GAttributesDescriptor.AttributeBuilder |
dashedWordsLabelFilter(Boolean value)
Filters out labels containing words starting or ending in a dash character ('-'). |
Lingo3GAttributesDescriptor.AttributeBuilder |
dashedWordsSynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will treat words separated by a space
(' '), period ('.'), slash ('/') or a dash ('-') or written together and the
corresponding phrases as synonymous, e.g. |
Lingo3GAttributesDescriptor.AttributeBuilder |
dictionaryLabelFilter(Boolean value)
Removes or boosts labels based on a predefined dictionary of words, phrases and
regular expressions. |
Lingo3GAttributesDescriptor.AttributeBuilder |
dictionarySynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will apply synonyms defined in the
synonyms.[lang].xml file. |
Lingo3GAttributesDescriptor.AttributeBuilder |
dictionaryWeightLabelScorerWeight(Double value)
Boosts label scores by a factor specified in the label dictionary file. |
Lingo3GAttributesDescriptor.AttributeBuilder |
documentCountLabelScorerWeight(Double value)
Assigns higher scores to clusters whose number of documents in relation to the
total number of documents is equal or smaller than specified by the 'Maximum
cluster size' parameter. |
Lingo3GAttributesDescriptor.AttributeBuilder |
documentCoverageTarget(Double value)
The percentage of input documents to be put in clusters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
extraktSynonymMarkerEnabled(Boolean value)
When switched on, the clustering engine will apply synonyms obtained from the
Extrakt linguistic engine. |
Lingo3GAttributesDescriptor.AttributeBuilder |
flatMerging(Boolean value)
Flat merging switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
hierarchicalMerging(Boolean value)
Hierarchical merging switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
hierarchicalMergingWithLabels(Boolean value)
Label merging switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
labelOverrideThreshold(Double value)
Determines the strength of the truncated label filters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
languageRecognition(Boolean value)
Language recognition switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
leftCompleteLabelFilter(Boolean value)
Truncated labels filter. |
Lingo3GAttributesDescriptor.AttributeBuilder |
license(Class<? extends IResource> clazz)
An explicit program license resource. |
Lingo3GAttributesDescriptor.AttributeBuilder |
license(IResource value)
An explicit program license resource. |
Lingo3GAttributesDescriptor.AttributeBuilder |
lowercaseFunctionWords(Boolean value)
Use lower case for function words in labels. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxClusteringPassesSub(Integer value)
Maximum number of clustering passes to perform on subclusters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxClusteringPassesTop(Integer value)
Maximum number of clustering passes to perform on top hierarchy level. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxClusterSize(Double value)
Determines the maximum allowed size of a cluster in relation to the parent cluster
size. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxHierarchyDepth(Integer value)
The maximum number of cluster levels to create. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxImprovementIterations(Integer value)
The number of clustering improvement iterations to perform. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxLabelWords(Integer value)
Determines the maximum label length in words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxTokensPerDocument(Integer value)
Maximum tokens per document to read. |
Lingo3GAttributesDescriptor.AttributeBuilder |
maxWordDf(Double value)
Maximum word document frequency. |
Lingo3GAttributesDescriptor.AttributeBuilder |
mergeThreshold(Double value)
Cluster merge threshold. |
Lingo3GAttributesDescriptor.AttributeBuilder |
minClusterSize(Double value)
Determines the minimum allowed size of a cluster in relation to the parent cluster
size. |
Lingo3GAttributesDescriptor.AttributeBuilder |
minClusterSizeForSubclusters(Integer value)
The minimum number of documents that must be assigned to a cluster before the
clustering engine attempts to create subclusters for that cluster. |
Lingo3GAttributesDescriptor.AttributeBuilder |
minLabelWords(Integer value)
Determines the minimum label length in words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
minLengthLabelFilter(Boolean value)
Filters out labels whose string representation (excluding spaces) is shorter than 3
characters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
neighborhoodSize(Integer value)
Maximum similar clusterings to examine. |
Lingo3GAttributesDescriptor.AttributeBuilder |
normalizeScores(Boolean value)
Cluster and label score normalization switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
numberOnlyLabelFilter(Boolean value)
Filters out labels that consist only of numeric tokens. |
Lingo3GAttributesDescriptor.AttributeBuilder |
oneLetterWordLabelFilter(Boolean value)
Filters out labels containing only one-letter words, e.g. |
Lingo3GAttributesDescriptor.AttributeBuilder |
phraseDfThesholdScalingFactor(Double value)
Phrase-level Document Frequency (DF) cut-off scaling factor. |
Lingo3GAttributesDescriptor.AttributeBuilder |
preciseDocumentAssignment(Boolean value)
When precise document assignment is switched off, clusters with multi word labels
will contain all documents that contain the label's word in any order and at any
position. |
Lingo3GAttributesDescriptor.AttributeBuilder |
preferredLabelLength(double value)
Instructs the clustering engine to prefer cluster labels consisting of the
specified number of words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
preferredLabelLengthDeviation(double value)
Allowed deviation from the preferred label length. |
Lingo3GAttributesDescriptor.AttributeBuilder |
putPromotedLabelsAtHierarchyRoot(Boolean value)
Put promoted labels at hierarchy root. |
Lingo3GAttributesDescriptor.AttributeBuilder |
queryWordLabelScorerWeight(Double value)
Penalizes labels that contain query words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
queryWordLabelWeight(Double value)
Determines the weight of labels containing query words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
reloadResources(Boolean value)
Forced resources reload switch. |
Lingo3GAttributesDescriptor.AttributeBuilder |
removeRepeatedSynonymsFromLabels(Boolean value)
Remove repeated synonyms from labels. |
Lingo3GAttributesDescriptor.AttributeBuilder |
repeatedWordsLabelFilter(Boolean value)
Filters out labels containing repeated words (e.g."New York York"). |
Lingo3GAttributesDescriptor.AttributeBuilder |
resourceLookup(Class<? extends ResourceLookup> clazz)
|
Lingo3GAttributesDescriptor.AttributeBuilder |
resourceLookup(ResourceLookup value)
|
Lingo3GAttributesDescriptor.AttributeBuilder |
rightCompleteLabelFilter(Boolean value)
Truncated labels filter. |
Lingo3GAttributesDescriptor.AttributeBuilder |
singleWordLabelWeight(Double value)
Determines how willing the clustering engine will be to select single words as
cluster labels. |
Lingo3GAttributesDescriptor.AttributeBuilder |
tfDfRatioLabelScorerWeight(Double value)
Assigns higher score to more general/shorter labels. |
Lingo3GAttributesDescriptor.AttributeBuilder |
tfLabelScorerWeight(Double value)
Assigns higher scores to labels with higher Term Frequency (TF). |
Lingo3GAttributesDescriptor.AttributeBuilder |
titleFields(List<String> value)
Title fields to use for clustering. |
Lingo3GAttributesDescriptor.AttributeBuilder |
titleWordLabelScorerWeight(Double value)
Assigns higher scores to labels that contain word that appeared in input documents'
titles. |
Lingo3GAttributesDescriptor.AttributeBuilder |
trailingGenitiveLabelFilter(Boolean value)
Filters out phrases ending in Saxon genitive of an English noun, e.g. |
Lingo3GAttributesDescriptor.AttributeBuilder |
unindexedWordLabelScorerWeight(Double value)
Penalizes labels that contain too many function words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
unknownWordHandlingStrategy(Class<? extends Lingo3GAttributes.UnknownWordHandlingStrategy> clazz)
Handling of unknown words in persistent clusters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
unknownWordHandlingStrategy(Lingo3GAttributes.UnknownWordHandlingStrategy value)
Handling of unknown words in persistent clusters. |
Lingo3GAttributesDescriptor.AttributeBuilder |
useBuiltInWordDatabaseForLabelFiltering(boolean value)
Use built-in word database for label filtering. |
Lingo3GAttributesDescriptor.AttributeBuilder |
useBuiltInWordDatabaseForStemming(boolean value)
Use built-in word database for stemming. |
Lingo3GAttributesDescriptor.AttributeBuilder |
wordCountLabelScorerWeight(Double value)
Assigns higher scores to labels that consist of 2, 3 or 4 words. |
Lingo3GAttributesDescriptor.AttributeBuilder |
wordDfThesholdScalingFactor(Double value)
Word-level Document Frequency (DF) cut-off scaling factor. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
map
public final Map<String,Object> map
- The attribute map populated by this builder.
Lingo3GAttributesDescriptor.AttributeBuilder
protected Lingo3GAttributesDescriptor.AttributeBuilder(Map<String,Object> map)
- Creates a builder backed by the provided map.
reloadResources
public Lingo3GAttributesDescriptor.AttributeBuilder reloadResources(Boolean value)
- Forced resources reload switch. Causes the clustering engine to reload lexical
resources (stopwords, label dictionaries, synonyms etc.) on every clustering
request. This is a debug-only switch, particularly useful when tuning lexical
resources.
When running Lingo3G within Lingo3G Workbench, the lexical resources are loaded
from the workspace subdirectory of the Lingo3G Workbench installation
directory. If resource reloading is enabled, all changes made to the lexical
resources will take effect immediately and will not require restarting Lingo3G
Workbench.
Performance impact: very high. Make sure resource reloading is switched off in
production settings.
- See Also:
Lingo3GAttributes.reloadResources
resourceLookup
public Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(ResourceLookup value)
- See Also:
Lingo3GAttributes.resourceLookup
resourceLookup
public Lingo3GAttributesDescriptor.AttributeBuilder resourceLookup(Class<? extends ResourceLookup> clazz)
- See Also:
Lingo3GAttributes.resourceLookup
license
public Lingo3GAttributesDescriptor.AttributeBuilder license(IResource value)
- An explicit program license resource. By default, the license is sought in a set of
default locations. This attribute provides an explicit license to be used. If
this attribute has a non-null value, default locations are not scanned.
- See Also:
Lingo3GAttributes.license
license
public Lingo3GAttributesDescriptor.AttributeBuilder license(Class<? extends IResource> clazz)
- An explicit program license resource. By default, the license is sought in a set of
default locations. This attribute provides an explicit license to be used. If
this attribute has a non-null value, default locations are not scanned.
- See Also:
Lingo3GAttributes.license
maxHierarchyDepth
public Lingo3GAttributesDescriptor.AttributeBuilder maxHierarchyDepth(Integer value)
- The maximum number of cluster levels to create. Setting this parameter to 1 will
disable hierarchical clustering. In such case it is also recommended to disable
hierarchical merging, which will preserve smaller clusters.
Performance impact: high
- See Also:
Lingo3GAttributes.maxHierarchyDepth
clusterCountBase
public Lingo3GAttributesDescriptor.AttributeBuilder clusterCountBase(Integer value)
- The number of clusters discovered in each clustering pass. The higher the value of
this parameter, the larger the total number of clusters.
Performance impact: medium
- See Also:
Lingo3GAttributes.clusterCountBase
maxClusteringPassesTop
public Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesTop(Integer value)
- Maximum number of clustering passes to perform on top hierarchy level. Determines
the maximum number of cluster discovery passes the clustering engine should perform
to discover the top-level clusters. The first clustering pass discovers large/more
general clusters, while further passes find smaller/more specific clusters. Setting
the maximum number of passes to 0 will force the algorithm to stop clustering only
when no more clusters can be created or the 'Document coverage target' has been
reached.
Performance impact: high
Results impact: With the lowest value of this parameter, the clustering engine will
discover only the largest clusters, while with higher values, smaller and more
specific clusters will also be created. Setting this parameter to 0 will cause the
clustering algorithm to create the maximum possible number of clusters.
- See Also:
Lingo3GAttributes.maxClusteringPassesTop
maxClusteringPassesSub
public Lingo3GAttributesDescriptor.AttributeBuilder maxClusteringPassesSub(Integer value)
- Maximum number of clustering passes to perform on subclusters. Determines the
maximum number of cluster discovery passes the clustering engine should perform to
discover subclusters. The first clustering pass discovers large/more general
clusters, while further passes find smaller/more specific clusters. Setting the
maximum number of passes to 0 will force the algorithm to stop clustering only when
no more subclusters can be created or the 'Document coverage target' has been
reached for the parent cluster.
Performance impact: high
Results impact: With the lowest value of this parameter, the clustering engine will
discover only the largest clusters, while with higher values, smaller and more
specific clusters will also be created. Setting this parameter to 0 will cause the
clustering algorithm to create the maximum possible number of subclusters for each
cluster.
- See Also:
Lingo3GAttributes.maxClusteringPassesSub
documentCoverageTarget
public Lingo3GAttributesDescriptor.AttributeBuilder documentCoverageTarget(Double value)
- The percentage of input documents to be put in clusters. Determines the percentage
of documents the clustering engine should assign to clusters. After each clustering
pass, the clustering engine will check if the required document coverage has been
achieved. If so, it will not perform further clustering passes. The required
document coverage may not always be achieved, especially if the maximum number of
clustering passes is set to a low value. To cause the clustering engine to always
perform the maximum number of clustering passes, set the value of this parameter to
1.0.
Performance impact: high
- See Also:
Lingo3GAttributes.documentCoverageTarget
maxImprovementIterations
public Lingo3GAttributesDescriptor.AttributeBuilder maxImprovementIterations(Integer value)
- The number of clustering improvement iterations to perform. Determines the maximum
number of clustering improvement cycles the clustering engine should perform.
During each cycle, it will examine clusterings similar to the current one, and if
any of them is better, the current cluster arrangement will be replaced.
Performance impact: very high
- See Also:
Lingo3GAttributes.maxImprovementIterations
neighborhoodSize
public Lingo3GAttributesDescriptor.AttributeBuilder neighborhoodSize(Integer value)
- Maximum similar clusterings to examine. Determines the maximum number of similar
clusterings the clustering engine should examine during each improvement cycle.
This parameter is meaningful only when 'Maximum improvement iterations' is greater
than 0.
Performance impact: very high
- See Also:
Lingo3GAttributes.neighborhoodSize
mergeThreshold
public Lingo3GAttributesDescriptor.AttributeBuilder mergeThreshold(Double value)
- Cluster merge threshold. If the overlap between clusters is larger than the value
of this parameter, these clusters will be merged.
Performance impact: none
Results impact: Low values of this parameter will cause the clustering engine to
eagerly merge clusters, which will create larger clusters in which some documents
may be irrelevant. High values of this parameter will cause it to merge clusters
rarely, which will result in large numbers of small clusters with more relevant
documents.
- See Also:
Lingo3GAttributes.mergeThreshold
flatMerging
public Lingo3GAttributesDescriptor.AttributeBuilder flatMerging(Boolean value)
- Flat merging switch. When switched on, the clustering engine will perform cluster
merging using a strategy specific for flat (non-hierarchical) clusters. With this
strategy the clustering engine will merge only clusters of similar size.
Performance impact: low
- See Also:
Lingo3GAttributes.flatMerging
hierarchicalMerging
public Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMerging(Boolean value)
- Hierarchical merging switch. When switched on, the clustering engine will use a
cluster merging strategy specially designed for hierarchical clustering, and will
be more eager to move clusters from the top level positions to subclusters. If the
algorithm is set to perform flat clustering (max-hierarchy-depth = 1), disabling
hierarchical merging is recommended to preserve smaller clusters.
Performance impact: low
- See Also:
Lingo3GAttributes.hierarchicalMerging
hierarchicalMergingWithLabels
public Lingo3GAttributesDescriptor.AttributeBuilder hierarchicalMergingWithLabels(Boolean value)
- Label merging switch. When switched on, the clustering engine will take cluster
labels into account while hierarchical merging of clusters. This parameter is
meaningful only when 'Hierarchical merging' is switched on.
Performance impact: low
Results impact: With label merging switched on, the clustering engine may move some
additional clusters from the top level to subclusters.
- See Also:
Lingo3GAttributes.hierarchicalMergingWithLabels
cloningControl
public Lingo3GAttributesDescriptor.AttributeBuilder cloningControl(Boolean value)
- Cluster cloning control switch. When switched on, the clustering engine will not
allow the same cluster label to appear both at the top- and subcluster-level of the
hierarchy.
Performance impact: low
- See Also:
Lingo3GAttributes.cloningControl
aggressiveCloningControl
public Lingo3GAttributesDescriptor.AttributeBuilder aggressiveCloningControl(Boolean value)
- Aggressive cluster cloning control switch. When switched on, the clustering engine
will not allow the same label to appear at any level of the hierarchy. This
parameter is meaningful only if 'Cluster cloning control' is switched on.
Performance impact: low
- See Also:
Lingo3GAttributes.aggressiveCloningControl
minClusterSize
public Lingo3GAttributesDescriptor.AttributeBuilder minClusterSize(Double value)
- Determines the minimum allowed size of a cluster in relation to the parent cluster
size. E.g. a value of 0.4 means that clusters must not contain less than 40% of the
parent cluster's documents (of all documents in case of top-level clusters). This
parameter is meaningful only if 'Document count label scorer weight' is greater
than 0.
Performance impact: none
- See Also:
Lingo3GAttributes.minClusterSize
maxClusterSize
public Lingo3GAttributesDescriptor.AttributeBuilder maxClusterSize(Double value)
- Determines the maximum allowed size of a cluster in relation to the parent cluster
size. E.g. a value of 0.4 means that clusters must not contain more than 40% of the
parent cluster's documents (of all documents in case of top-level clusters). This
parameter is meaningful only if 'Document count label scorer weight' is greater
than 0.
Performance impact: none
- See Also:
Lingo3GAttributes.maxClusterSize
combinedClusterScoreBalance
public Lingo3GAttributesDescriptor.AttributeBuilder combinedClusterScoreBalance(Double value)
- Decides whether document count or cluster label score should have larger impact on
the cluster score. Setting this parameter to 0.5 will cause the clustering engine
to assign equal weight to document count and cluster label score during cluster
score calculation. A value equal to 1.0 will cause the clustering engine to use
only document count for cluster scoring. Similarly, with the 0.0 value, only the
cluster label score will be used.
Performance impact: none
- See Also:
Lingo3GAttributes.combinedClusterScoreBalance
minClusterSizeForSubclusters
public Lingo3GAttributesDescriptor.AttributeBuilder minClusterSizeForSubclusters(Integer value)
- The minimum number of documents that must be assigned to a cluster before the
clustering engine attempts to create subclusters for that cluster.
Performance impact: high
- See Also:
Lingo3GAttributes.minClusterSizeForSubclusters
preciseDocumentAssignment
public Lingo3GAttributesDescriptor.AttributeBuilder preciseDocumentAssignment(Boolean value)
- When precise document assignment is switched off, clusters with multi word labels
will contain all documents that contain the label's word in any order and at any
position. When precise document assignment is switched on, only documents
containing all cluster label's words close to each other will be placed in the
cluster.
Performance impact: high
- See Also:
Lingo3GAttributes.preciseDocumentAssignment
normalizeScores
public Lingo3GAttributesDescriptor.AttributeBuilder normalizeScores(Boolean value)
- Cluster and label score normalization switch. When switched on, the clustering
engine will normalize cluster and label scores so that they fall in the 0.0 to 1.0
range.
Performance impact: none
Results impact: As the value of this parameter does not have any impact on the
order and structure of clusters generated by the clustering engine, this switch
will be useful only for applications that depend on absolute values of cluster or
label scores.
- See Also:
Lingo3GAttributes.normalizeScores
allowOneDocumentClusters
public Lingo3GAttributesDescriptor.AttributeBuilder allowOneDocumentClusters(Boolean value)
- When enabled, the algorithm will not prune clusters containing only one document.
Tip: For collections larger than 100 documents, to get
one-document clusters, you also need to set Lingo3GAttributes.wordDfThesholdScalingFactor
and Lingo3GAttributes.phraseDfThesholdScalingFactor to 0.0.
Tip: When one-document clusters are allowed, the number of larger
clusters may decrease. To obtain more larger clusters while keeping the
one-document ones, increase Lingo3GAttributes.maxClusteringPassesTop and
Lingo3GAttributes.maxClusteringPassesSub or set them to 0.
Performance impact: medium.
- See Also:
Lingo3GAttributes.allowOneDocumentClusters
singleWordLabelWeight
public Lingo3GAttributesDescriptor.AttributeBuilder singleWordLabelWeight(Double value)
- Determines how willing the clustering engine will be to select single words as
cluster labels. The higher the value of this parameter, the more clusters described
with single-word labels will be produced.
Performance impact: none
- See Also:
Lingo3GAttributes.singleWordLabelWeight
minLabelWords
public Lingo3GAttributesDescriptor.AttributeBuilder minLabelWords(Integer value)
- Determines the minimum label length in words. Labels consisting of fewer words will
not be generated.
Performance impact: none
Results impact: Setting the minimum label length to some higher value (e.g. 4 or 5)
may create more specific clusters.
- See Also:
Lingo3GAttributes.minLabelWords
maxLabelWords
public Lingo3GAttributesDescriptor.AttributeBuilder maxLabelWords(Integer value)
- Determines the maximum label length in words. Labels consisting of more words will
not be generated.
Performance impact: none
Results impact: Setting the maximum label length to some lower value (e.g. 2 or 3)
may create more general clusters.
This setting can also be useful when the input collection contains duplicate
documents. In such cases, Lingo3G may create overlong cluster labels taken directly
from the duplicate documents. While the best solution to this problem would be
eliminating duplicate documents from input, lowering the maximum label length can
serve as a simple workaround.
- See Also:
Lingo3GAttributes.maxLabelWords
preferredLabelLength
public Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLength(double value)
- Instructs the clustering engine to prefer cluster labels consisting of the
specified number of words. The strength of the preference is determined by the
Lingo3GAttributes.preferredLabelLengthDeviation attribute.
Fractional preferred label lengths are also allowed. For example, preferred label
length of 2.5 will result in labels of length 2 and 3 being treated equally
preferred; a value of 2.2 will prefer two-word labels more than three-word ones.
Performance impact: none
- See Also:
Lingo3GAttributes.preferredLabelLength
preferredLabelLengthDeviation
public Lingo3GAttributesDescriptor.AttributeBuilder preferredLabelLengthDeviation(double value)
- Allowed deviation from the preferred label length. Determines how far the
clustering engine is allowed to deviate from the
Lingo3GAttributes.preferredLabelLength. A
value of 0.0 allows no deviation: all labels must have the preferred length. Larger
values allow more and more deviation, with the value of 20.0 meaning almost no
preference at all.
When the preferred label length deviation is 0.0 and the fractional part of the
preferred label length is 0.5, then the only allowed label lengths will be the two
integers closest to the preferred label length value. For example, if preferred
label length deviation is 0.0 and preferred label length is 2.5, the clustering
engine will create only labels consisting of 2 or 3 words. If the fractional part
of the preferred label length is other than 0.5, only the closest integer label
length will be preferred.
Performance impact: none
- See Also:
Lingo3GAttributes.preferredLabelLengthDeviation
labelOverrideThreshold
public Lingo3GAttributesDescriptor.AttributeBuilder labelOverrideThreshold(Double value)
- Determines the strength of the truncated label filters. The lowest value means
strongest truncated labels elimination, which may lead to overlong cluster labels
and many unclustered documents. The highest value effectively disables the filter,
which may result in short or truncated labels.
Performance impact: low
- See Also:
Lingo3GAttributes.labelOverrideThreshold
queryWordLabelWeight
public Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelWeight(Double value)
- Determines the weight of labels containing query words. Lower values mean that
phrases containing query words are less likely to appear as cluster labels. In
particular, the value of 0.0 will totally eliminate query words from cluster
labels. The value of 1.0, on the other hand, will cause the clustering engine to
treat equally labels with and without query words.
Performance impact: low
- See Also:
Lingo3GAttributes.queryWordLabelWeight
allowNumbersInLabels
public Lingo3GAttributesDescriptor.AttributeBuilder allowNumbersInLabels(Boolean value)
- Allow numbers in labels switch. When switched on, the clustering engine will allow
numbers to appear in cluster labels.
Performance impact: low
- See Also:
Lingo3GAttributes.allowNumbersInLabels
lowercaseFunctionWords
public Lingo3GAttributesDescriptor.AttributeBuilder lowercaseFunctionWords(Boolean value)
- Use lower case for function words in labels. When switched on, the clustering
engine will convert all function words in labels into lower case. When switched
off, particular function words will appear in labels in the case they appeared in
the majority of input documents.
Performance impact: low
- See Also:
Lingo3GAttributes.lowercaseFunctionWords
capitalizeNonFunctionWords
public Lingo3GAttributesDescriptor.AttributeBuilder capitalizeNonFunctionWords(Boolean value)
- Capitalize non function words in labels. When switched on, the clustering engine
will capitalize all non function words in labels. When switched off, particular
words will appear in labels in the case they appeared in the majority of input
documents.
Performance impact: low
- See Also:
Lingo3GAttributes.capitalizeNonFunctionWords
removeRepeatedSynonymsFromLabels
public Lingo3GAttributesDescriptor.AttributeBuilder removeRepeatedSynonymsFromLabels(Boolean value)
- Remove repeated synonyms from labels. When switched on, no synonymous words will
appear in a single label. For example, if 'photos' and 'pictures' are declared
synonyms, labels such as 'Tiger Photos Pictures" or "Photos and Pictures" will not
be generated.
Performance impact: low
- See Also:
Lingo3GAttributes.removeRepeatedSynonymsFromLabels
putPromotedLabelsAtHierarchyRoot
public Lingo3GAttributesDescriptor.AttributeBuilder putPromotedLabelsAtHierarchyRoot(Boolean value)
- Put promoted labels at hierarchy root. When switched on, labels promoted using the
label dictionary will be always put at the top level of the cluster hierarchy. When
switched off, promoted labels will not be forced to appear at the hierarchy root
and will be placed where they naturally belong, e.g. as subclusters of larger
clusters.
Results impact: a lot of labels can get promoted as a result of boosting e.g.
proper nouns defined in the built-in POS database. With this option enabled, all
such labels will be put at the root of cluster hierarchy, which may result in a
clearly visible cluster overlap. For example, clusters Bill Clinton,
President Bill Clinton and U.S. President
Bill Clinton will all show at the root of the cluster tree, while with this
option disabled, only the Bill Clinton cluster would be placed at root of
the hierarchy.
Performance impact: low
- See Also:
Lingo3GAttributes.putPromotedLabelsAtHierarchyRoot
maxTokensPerDocument
public Lingo3GAttributesDescriptor.AttributeBuilder maxTokensPerDocument(Integer value)
- Maximum tokens per document to read. Determines the maximum number of tokens
(words) the clustering engine will read from each input document. When this
parameter is set to 0, all tokens will be read.
Performance impact: high
- See Also:
Lingo3GAttributes.maxTokensPerDocument
wordDfThesholdScalingFactor
public Lingo3GAttributesDescriptor.AttributeBuilder wordDfThesholdScalingFactor(Double value)
- Word-level Document Frequency (DF) cut-off scaling factor. Determines how fast the
word DF cut-off should grow with the increase of the number of documents. A value
of 1.0 means that the word DF cut-off will increase by 1.0 per every 100 documents.
Thus, for 100 documents the word DF cut-off will be 1.0, for 200 documents it will
be 2.0, for 350 documents it will be 3.5 etc.
Performance impact: very high
Results impact: Setting low values for this parameter will preserve infrequent
words, which can result in more accurate clustering (especially at subcluster
level), at the cost of slower processing. Setting high values of this parameter
will increase performance at the cost of lower clustering accuracy.
- See Also:
Lingo3GAttributes.wordDfThesholdScalingFactor
phraseDfThesholdScalingFactor
public Lingo3GAttributesDescriptor.AttributeBuilder phraseDfThesholdScalingFactor(Double value)
- Phrase-level Document Frequency (DF) cut-off scaling factor. Determines how fast
the phrase DF cut-off should grow with the increase of the number of documents. A
value of 0.2 means that the phrase DF cut-off will increase by 0.2 per every 100
documents. Thus, for 100 documents the word DF cut-off will be 1.0, for 200
documents it will be 1.2, for 600 documents it will be 2.0 etc.
Performance impact: very high
Results impact: Setting low values for this parameter will preserve infrequent
phrases, which can result in more accurate clustering (especially at subcluster
level), at the cost of slower processing. Setting high values of this parameter
will increase performance at the cost of lower clustering accuracy.
- See Also:
Lingo3GAttributes.phraseDfThesholdScalingFactor
accentFolding
public Lingo3GAttributesDescriptor.AttributeBuilder accentFolding(Boolean value)
- Converts national characters to ASCII counterparts. When accent folding is switched
on, all national characters (e.g. 'ü', 'ç', 'ó') will be internally replaced with
their ASCII counterparts ('u', 'c', 'o'), which will make e.g. the words "Bücher"
and "Bucher" equivalent. Please note that this is an instance-level parameter and
changes of its value at request time will not be respected.
Performance impact: high
- See Also:
Lingo3GAttributes.accentFolding
languageRecognition
public Lingo3GAttributesDescriptor.AttributeBuilder languageRecognition(Boolean value)
- Language recognition switch. When switched on, for those input documents that do
not have the
Document.LANGUAGE field set, the clustering
engine will attempt to recognize their language. If a document already has the
Document.LANGUAGE set, it will be used for further
processing.
Performance impact: medium
- See Also:
Lingo3GAttributes.languageRecognition
trailingGenitiveLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder trailingGenitiveLabelFilter(Boolean value)
- Filters out phrases ending in Saxon genitive of an English noun, e.g.
"Discover World's", "For your computers'".
Performance impact: low
- See Also:
Lingo3GAttributes.trailingGenitiveLabelFilter
numberOnlyLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder numberOnlyLabelFilter(Boolean value)
- Filters out labels that consist only of numeric tokens.
Performance impact: low
- See Also:
Lingo3GAttributes.numberOnlyLabelFilter
dashedWordsLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsLabelFilter(Boolean value)
- Filters out labels containing words starting or ending in a dash character ('-').
Performance impact: low
- See Also:
Lingo3GAttributes.dashedWordsLabelFilter
oneLetterWordLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder oneLetterWordLabelFilter(Boolean value)
- Filters out labels containing only one-letter words, e.g. "M a f".
Performance impact: low
- See Also:
Lingo3GAttributes.oneLetterWordLabelFilter
minLengthLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder minLengthLabelFilter(Boolean value)
- Filters out labels whose string representation (excluding spaces) is shorter than 3
characters.
Performance impact: low
- See Also:
Lingo3GAttributes.minLengthLabelFilter
repeatedWordsLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder repeatedWordsLabelFilter(Boolean value)
- Filters out labels containing repeated words (e.g."New York York").
Performance impact: low
- See Also:
Lingo3GAttributes.repeatedWordsLabelFilter
dictionaryLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder dictionaryLabelFilter(Boolean value)
- Removes or boosts labels based on a predefined dictionary of words, phrases and
regular expressions. Impact on performance depends on the number of regular
expression entries in the label dictionary -- the more regular expression entries,
the lower the processing speed.
Performance impact: medium to very high
- See Also:
Lingo3GAttributes.dictionaryLabelFilter
leftCompleteLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder leftCompleteLabelFilter(Boolean value)
- Truncated labels filter. Heuristically eliminates truncated cluster labels (e.g.
"York Restaurants"), replacing them with complete phrases, e.g.
"New York Restaurants", based on the context. It is recommended to use this filter
in combination with 'Right complete label filter' . Strength of truncated label
elimination determined by the 'Label override threshold' parameter.
Performance impact: medium
- See Also:
Lingo3GAttributes.leftCompleteLabelFilter
rightCompleteLabelFilter
public Lingo3GAttributesDescriptor.AttributeBuilder rightCompleteLabelFilter(Boolean value)
- Truncated labels filter. Heuristically eliminates truncated cluster labels (e.g.
"York Restaurants"), replacing them with complete phrases, e.g.
"New York Restaurants", based on the context. It is recommended to use this filter
in combination with 'Left complete label filter' . Strength of truncated label
elimination is determined by the 'Label override threshold' parameter.
Performance impact: medium
- See Also:
Lingo3GAttributes.rightCompleteLabelFilter
dictionaryWeightLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder dictionaryWeightLabelScorerWeight(Double value)
- Boosts label scores by a factor specified in the label dictionary file. If this
scorer has weight 0, label boosting will not be applied.
Performance impact: low
- See Also:
Lingo3GAttributes.dictionaryWeightLabelScorerWeight
wordCountLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder wordCountLabelScorerWeight(Double value)
- Assigns higher scores to labels that consist of 2, 3 or 4 words. Longer labels are
penalized -- the longer the label, the higher the penalty.
Performance impact: low
- See Also:
Lingo3GAttributes.wordCountLabelScorerWeight
titleWordLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder titleWordLabelScorerWeight(Double value)
- Assigns higher scores to labels that contain word that appeared in input documents'
titles.
Performance impact: low
- See Also:
Lingo3GAttributes.titleWordLabelScorerWeight
capitalizedWordLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder capitalizedWordLabelScorerWeight(Double value)
- Assigns higher scores to labels that contain capitalized words.
Performance impact: low
- See Also:
Lingo3GAttributes.capitalizedWordLabelScorerWeight
maxWordDf
public Lingo3GAttributesDescriptor.AttributeBuilder maxWordDf(Double value)
- Maximum word document frequency. The maximum document frequency allowed for words
as a fraction of all documents. Words with document frequency larger than maxWordDf
will be ignored.
For example, when maxWordDf is 0.4, words appearing in more than 40% of documents
will be be ignored. A value of 1.0 means that all words will be taken into account,
no matter in how many documents they appear.
This attribute may be useful when certain words appear in most of the input
documents (e.g. company name from header or footer) and such words dominate the
cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9
may improve the clusters.
Another useful application of this attribute is when there is a need to generate
only very specific clusters, i.e. clusters containing small numbers of documents.
This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or
0.05.
Performance impact: low
- See Also:
Lingo3GAttributes.maxWordDf
unindexedWordLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder unindexedWordLabelScorerWeight(Double value)
- Penalizes labels that contain too many function words.
Performance impact: low
- See Also:
Lingo3GAttributes.unindexedWordLabelScorerWeight
queryWordLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder queryWordLabelScorerWeight(Double value)
- Penalizes labels that contain query words.
Performance impact: low
- See Also:
Lingo3GAttributes.queryWordLabelScorerWeight
tfDfRatioLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder tfDfRatioLabelScorerWeight(Double value)
- Assigns higher score to more general/shorter labels.
Performance impact: low
- See Also:
Lingo3GAttributes.tfDfRatioLabelScorerWeight
tfLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder tfLabelScorerWeight(Double value)
- Assigns higher scores to labels with higher Term Frequency (TF).
Performance impact: low
- See Also:
Lingo3GAttributes.tfLabelScorerWeight
documentCountLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder documentCountLabelScorerWeight(Double value)
- Assigns higher scores to clusters whose number of documents in relation to the
total number of documents is equal or smaller than specified by the 'Maximum
cluster size' parameter.
Performance impact: low
- See Also:
Lingo3GAttributes.documentCountLabelScorerWeight
clusterSetDocumentOverlapLabelScorerWeight
public Lingo3GAttributesDescriptor.AttributeBuilder clusterSetDocumentOverlapLabelScorerWeight(Double value)
- Assigns higher scores to labels that contain documents not present in the current
cluster set.
Performance impact: low
- See Also:
Lingo3GAttributes.clusterSetDocumentOverlapLabelScorerWeight
dictionarySynonymMarkerEnabled
public Lingo3GAttributesDescriptor.AttributeBuilder dictionarySynonymMarkerEnabled(Boolean value)
- When switched on, the clustering engine will apply synonyms defined in the
synonyms.[lang].xml file.
Performance impact: medium
- See Also:
Lingo3GAttributes.dictionarySynonymMarkerEnabled
dashedWordsSynonymMarkerEnabled
public Lingo3GAttributesDescriptor.AttributeBuilder dashedWordsSynonymMarkerEnabled(Boolean value)
- When switched on, the clustering engine will treat words separated by a space
(' '), period ('.'), slash ('/') or a dash ('-') or written together and the
corresponding phrases as synonymous, e.g. "data-mining", "data.mining",
"datamining", "data/mining" and "data mining".
Performance impact: medium
- See Also:
Lingo3GAttributes.dashedWordsSynonymMarkerEnabled
extraktSynonymMarkerEnabled
public Lingo3GAttributesDescriptor.AttributeBuilder extraktSynonymMarkerEnabled(Boolean value)
- When switched on, the clustering engine will apply synonyms obtained from the
Extrakt linguistic engine. This option is applicable only when the Extrakt engine
is available, ignored otherwise.
Performance impact: high
- See Also:
Lingo3GAttributes.extraktSynonymMarkerEnabled
titleFields
public Lingo3GAttributesDescriptor.AttributeBuilder titleFields(List<String> value)
- Title fields to use for clustering. Specifies the list of document field names that
provide the content for clustering. Depending on the value of the
title-word-label-scorer-weight attribute, content of fields provided
in this attribute can be given more weight during clustering.
- See Also:
Lingo3GAttributes.titleFields
contentFields
public Lingo3GAttributesDescriptor.AttributeBuilder contentFields(List<String> value)
- Content fields to use for clustering. Specifies the list of document field names
that provide the content for clustering. As opposed to the
title-fields attribute, fields provided in this attribute will not be
given any extra weight during clustering.
- See Also:
Lingo3GAttributes.contentFields
clusterScoringFields
public Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Lingo3GAttributes.ClusterScoringFields value)
- Extra fields to use for cluster scoring. If your input data contains structured
data in addition to unstructured text, you can use the structured data to guide
Lingo3G towards creating clusters having some specific properties.
Usage
scenario
For example, let us assume your data describes e-commerce products and has the
following fields:
- title, description: unstructured text,
- price: product price expressed as a number, e.g.
149.90,
- category: high level product category, e.g.
Fashion.
While Lingo3G will draw cluster labels from the unstructured text of the
title and description fields, it can also use the the structured
data to e.g. (see below for formal syntax specification):
-
Minimize category variety: avoid creating clusters
containing a mix of products from different categories; each cluster should ideally
contain products from one category only.
category:nominal:MINIMIZE_VARIETY:1.0
-
Maximize category variety: avoid creating clusters with
products from the same category; each cluster should ideally contain a mix of
products from as many categories as possible.
category:nominal:MAXIMIZE_VARIETY:1.0
-
Minimize price variety: promote clusters of similarly
priced products.
price:numeric:MINIMIZE_VARIETY:1.0
-
Maximize price variety: promote clusters containing a
wide range of product prices.
price:numeric:MAXIMIZE_VARIETY:1.0
-
Minimize/maximize price value: promote clusters with the
smallest/largest total product price.
price:numeric:MINIMIZE_VALUE:1.0
or
price:numeric:MAXIMIZE_VALUE:1.0
Syntax
Cluster scoring field specification has the following form:
field:type:scoring:weight
where:
You can use commas to perform cluster scoring based on more than one field, e.g.:
field1:type1:scoring1:weight1, field2:type2:scoring2:weight2, ...
Adding extra fields to Carrot2 input XML
You can specify the extra field in Carrot2 XML documents using the field
tag in the following way:
<document>
<title>Canon 5D</title>
<snippet>21MP camera</snippet>
<url></url>
<field key="price"><value type="java.lang.Double" value="149.90" /></field>
<field key="votes"><value type="java.lang.Integer" value="4370" /></field>
<field key="category"><value type="java.lang.String" value="Photo" /></field>
</document>
- See Also:
Lingo3GAttributes.clusterScoringFields
clusterScoringFields
public Lingo3GAttributesDescriptor.AttributeBuilder clusterScoringFields(Class<? extends Lingo3GAttributes.ClusterScoringFields> clazz)
- Extra fields to use for cluster scoring. If your input data contains structured
data in addition to unstructured text, you can use the structured data to guide
Lingo3G towards creating clusters having some specific properties.
Usage
scenario
For example, let us assume your data describes e-commerce products and has the
following fields:
- title, description: unstructured text,
- price: product price expressed as a number, e.g.
149.90,
- category: high level product category, e.g.
Fashion.
While Lingo3G will draw cluster labels from the unstructured text of the
title and description fields, it can also use the the structured
data to e.g. (see below for formal syntax specification):
-
Minimize category variety: avoid creating clusters
containing a mix of products from different categories; each cluster should ideally
contain products from one category only.
category:nominal:MINIMIZE_VARIETY:1.0
-
Maximize category variety: avoid creating clusters with
products from the same category; each cluster should ideally contain a mix of
products from as many categories as possible.
category:nominal:MAXIMIZE_VARIETY:1.0
-
Minimize price variety: promote clusters of similarly
priced products.
price:numeric:MINIMIZE_VARIETY:1.0
-
Maximize price variety: promote clusters containing a
wide range of product prices.
price:numeric:MAXIMIZE_VARIETY:1.0
-
Minimize/maximize price value: promote clusters with the
smallest/largest total product price.
price:numeric:MINIMIZE_VALUE:1.0
or
price:numeric:MAXIMIZE_VALUE:1.0
Syntax
Cluster scoring field specification has the following form:
field:type:scoring:weight
where:
You can use commas to perform cluster scoring based on more than one field, e.g.:
field1:type1:scoring1:weight1, field2:type2:scoring2:weight2, ...
Adding extra fields to Carrot2 input XML
You can specify the extra field in Carrot2 XML documents using the field
tag in the following way:
<document>
<title>Canon 5D</title>
<snippet>21MP camera</snippet>
<url></url>
<field key="price"><value type="java.lang.Double" value="149.90" /></field>
<field key="votes"><value type="java.lang.Integer" value="4370" /></field>
<field key="category"><value type="java.lang.String" value="Photo" /></field>
</document>
- See Also:
Lingo3GAttributes.clusterScoringFields
unknownWordHandlingStrategy
public Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Lingo3GAttributes.UnknownWordHandlingStrategy value)
- Handling of unknown words in persistent clusters. Defines how Lingo3G should treat
unknown words in labels of persistent clusters. A word is unknown when it occurs in
the persistent cluster's label but it is not present in any of the documents being
clustered.
The two available options are:
- DO_NOT_ASSIGN_DOCUMENTS: ignore the persistent cluster as a
whole. No documents will be assigned to persistent clusters with unknown words in
their labels. This option favours assignment precision at the cost of some
potentially relevant documents not being assigned to persistent clusters.
- ASSIGN_DOCUMENTS: ignores the missing word. Documents will be
assigned to persistent clusters even if some of their label's words do not occur in
the input documents. This options favours assignment recall at the cost of some
potentially irrelevant documents being assigned to persistent clusters.
Performance impact: none
- See Also:
Lingo3GAttributes.unknownWordHandlingStrategy
unknownWordHandlingStrategy
public Lingo3GAttributesDescriptor.AttributeBuilder unknownWordHandlingStrategy(Class<? extends Lingo3GAttributes.UnknownWordHandlingStrategy> clazz)
- Handling of unknown words in persistent clusters. Defines how Lingo3G should treat
unknown words in labels of persistent clusters. A word is unknown when it occurs in
the persistent cluster's label but it is not present in any of the documents being
clustered.
The two available options are:
- DO_NOT_ASSIGN_DOCUMENTS: ignore the persistent cluster as a
whole. No documents will be assigned to persistent clusters with unknown words in
their labels. This option favours assignment precision at the cost of some
potentially relevant documents not being assigned to persistent clusters.
- ASSIGN_DOCUMENTS: ignores the missing word. Documents will be
assigned to persistent clusters even if some of their label's words do not occur in
the input documents. This options favours assignment recall at the cost of some
potentially irrelevant documents being assigned to persistent clusters.
Performance impact: none
- See Also:
Lingo3GAttributes.unknownWordHandlingStrategy
carrot2StemmerFactory
public Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(IStemmerFactory value)
- Stemmer factory. Creates the stemmers to be used by the clustering algorithm.
- See Also:
Lingo3GAttributes.carrot2StemmerFactory
carrot2StemmerFactory
public Lingo3GAttributesDescriptor.AttributeBuilder carrot2StemmerFactory(Class<? extends IStemmerFactory> clazz)
- Stemmer factory. Creates the stemmers to be used by the clustering algorithm.
- See Also:
Lingo3GAttributes.carrot2StemmerFactory
carrot2TokenizerFactory
public Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(ITokenizerFactory value)
- Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm.
- See Also:
Lingo3GAttributes.carrot2TokenizerFactory
carrot2TokenizerFactory
public Lingo3GAttributesDescriptor.AttributeBuilder carrot2TokenizerFactory(Class<? extends ITokenizerFactory> clazz)
- Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm.
- See Also:
Lingo3GAttributes.carrot2TokenizerFactory
useBuiltInWordDatabaseForLabelFiltering
public Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForLabelFiltering(boolean value)
- Use built-in word database for label filtering. If enabled, Lingo3G will perform
label filtering based on the the built-in word databases in addition to the word
dictionary XML files. Currently, a built-in word database is available only for the
English language.
Results impact: If this option is enabled, Lingo3G should produce better-formed
cluster labels. For example, labels being, starting or ending with a verb or
adjective should appear less frequently. However, because of the limitations of the
current part of speech tagging model (please see below), enabling this option is
also likely to prevent certain well-formed cluster labels, e.g. if the built-in
word database misinterprets a noun for a verb.
Limitations of the part of speech tagging model. Currently, Lingo3G uses a unigram
model for assigning part of speech tags to words. This means that for each word
having multiple part of speech tags (such as "program" in English, which, depending
on the context, can be both a verb and a noun), one of the available tags needs to
be chosen. To do that, Lingo3G employs a heuristic that takes into account the word
frequency and the set of part of speech tags the word has. While the heuristic is
fairly efficient in a general, some words may be tagged erroneously. To provide a
solution for such cases, the built-in part of speech database tags can be
overridden in the user-defined XML word dictionary.
Performance impact: small.
- See Also:
Lingo3GAttributes.useBuiltInWordDatabaseForLabelFiltering
useBuiltInWordDatabaseForStemming
public Lingo3GAttributesDescriptor.AttributeBuilder useBuiltInWordDatabaseForStemming(boolean value)
- Use built-in word database for stemming. If enabled, Lingo3G will use the word
inflection database rather than an algorithmic stemmer. Currently, word inflection
database is available only for the English language.
Stemmers or word inflection databases transform various form of a word to one
common root. This is required to make sure that a cluster labeled e.g.
Programming contains documents referencing all variants of the
word, such as programs, programmer or
programmed.
Results impact: Algorithmic stemming tends to be more aggressive compared to
stemming based on word inflection dictionaries shipping with Lingo3G. This means
that with algorithmic stemming all the following forms: program,
programming, programmer and programmable will be treated
as the same concept, while with the word database based stemming, they will be
treated as separate, different concepts. As a result, with algorithmic stemming, a
cluster labeled Program will contain documents referring to all
program, programs, programming programmer and
programmable, while with the word database based stemming, the cluster
will contain only documents referring to program and programs.
Enabling this option is recommended only when it is important do distinguish
between slight variations of the same general concept, e.g. programming
and program.
Performance impact: small.
- See Also:
Lingo3GAttributes.useBuiltInWordDatabaseForStemming
Copyright (c) Dawid Weiss, Stanislaw Osinski