Skip navigation links

Lingo3G v1.15.1 API Documentation

This is the documentation of the Java API of Lingo3G Document Clustering Engine.

See: Description

Lingo3G Algorithm 
Package Description
com.carrotsearch.lingo3g
Lingo3G clustering algorithm component, the algorithm uses the infrastructure defined by the Carrot2 framework.
Licensing 
Package Description
com.carrotsearch.licensing
License file verification.
Carrot2 Core 
Package Description
org.carrot2.core
Definitions of Carrot2 core interfaces and their implementations.
org.carrot2.core.attribute
Attribute annotations for Carrot2 core interfaces.

This is the documentation of the Java API of Lingo3G Document Clustering Engine.

For complete examples of Lingo3G Java API usage, please see the source code located in the examples/ directory of Lingo3G Java API distribution archive. For clustering controller API and other miscellaneous examples, refer to the Carrot2 project documentation.

Java API usage examples

Lingo3G Java API is based on the framework defined by the Carrot2 open source project. You can use the components available in Carrot2 to fetch documents from various sources (public search engines, Lucene, Solr), serialize the results to JSON or XML and many more, while the the clusters are generated by Lingo3G. Below is some example code for the most common use cases. Please see the examples/ directory in the Lingo3G Java API distribution archive for complete source code. You can also browse Carrot2 code repository for further examples.

Clustering text documents

The easiest way to get started with Lingo3G is to cluster a collection of Documents. Each document can consist of:

  • document content: a query-in-context snippet, document abstract or full text,
  • document title: optional, some clustering algorithms give more weight to document titles,
  • document URL: optional, used by the ByUrlClusteringAlgorithm, ignored by other algorithms.

To make the example short, the code shown below clusters only 5 documents. Use at least 20 to get reasonable clusters. If you have access to the query that generated the documents being clustered, you should also provide it to Lingo3G to get better clusters.

            /* A few example documents, normally you would need at least 20 for reasonable clusters. */
            final String [][] data = new String [] []
            {
                {
                    "http://en.wikipedia.org/wiki/Data_mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns."
                },

                {
                    "http://www.ccsu.edu/datamining/resources.html",
                    "CCSU - Data Mining",
                    "A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..."
                },

                {
                    "http://www.kdnuggets.com/",
                    "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery",
                    "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
                },

                {
                    "http://en.wikipedia.org/wiki/Data-mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
                },

                {
                    "http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm",
                    "Data Mining: What is Data Mining?",
                    "Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works."
                },
            };

            /* Prepare Carrot2 documents */
            final ArrayList<Document> documents = new ArrayList<Document>();
            for (String [] row : data)
            {
                documents.add(new Document(row[1], row[2], row[0]));
            }

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();

            /*
             * Perform clustering by topic using the Lingo3G algorithm. Lingo3G can 
             * take advantage of the original query, so we provide it along with the documents.
             */
            final ProcessingResult result = controller.process(documents, "data mining",
                Lingo3GClusteringAlgorithm.class);
            final List<Cluster> clusters = result.getClusters();

Clustering documents from document sources

With default settings

One common way to use Lingo3G Java API is to fetch a number of documents from some IDocumentSource and cluster them. The simplest yet least flexible way to do it is to use the Controller.process(String, Integer, Class...) method from the Controller. The code shown below retrieves 100 search results for query data mining from EToolsDocumentSource and clusters them using the Lingo3GClusteringAlgorithm.
            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Perform processing */
            final ProcessingResult result = controller.process("data mining", 100,
                EToolsDocumentSource.class, Lingo3GClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Lingo3G. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();

With custom settings

If your production code needs to fetch documents from popular search engines, it is very important that you generate and use your own API key. You can pass the API key along with the query and the requested number of results in an attribute map. Lingo3G manual lists all supported attributes along with their keys, types and allowed values. The code shown below, fetches and clusters 50 results from Bing5DocumentSource.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
    
            /* Prepare attributes */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put your own API key here! */
            Bing5DocumentSourceDescriptor.attributeBuilder(attributes)
                .apiKey(BingKeyAccess.getKey());
    
            /* Query an the required number of results */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "clustering");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 50);
    
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes, 
                Bing5DocumentSource.class, Lingo3GClusteringAlgorithm.class);

            /* Documents fetched from the document source, clusters created by Lingo3G. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();

Setting attributes of clustering algorithms and document sources

By attribute keys

You can change the default behaviour of Lingo3G by changing its attributes. For a complete list of available attributes, their identifiers, types and allowed values, please see Lingo3G manual.

To pass attributes to Lingo3G, put them into a Map, along with the documents being clustered. The code shown below searches the web using Bing5DocumentSource and clusters the results using Lingo3GClusteringAlgorithm customized to create flat clustering.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put attribute values using direct keys. */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "data mining");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 100);
            attributes.put("max-hierarchy-depth", 1);

            /* Put your own API key here! */
            attributes.put(Bing5DocumentSourceDescriptor.Keys.API_KEY, BingKeyAccess.getKey()); 

            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Bing5DocumentSource.class, Lingo3GClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Lingo3G. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();

Using attribute builders

As an alternative to the raw attribute map used in the previous example, you can use attribute map builders. Attribute map builders have a number of advantages:

  • Type-safety: the correct type of the value will be enforced at compile time
  • Error prevention: unexpected results caused by typos in attribute name strings are avoided
  • Early error detection: in case an attribute's key changes, your compiler will detect that
  • IDE support: your IDE will suggest the right method names and parameters

A possible disadvantage of attribute builders is that one algorithm's attributes can be divided into a number of builders and hence not readily available in your IDE's auto complete window. Please consult attribute documentation in Lingo3G manual for pointers to the appropriate builder classes and methods.

The code shown below clusters clusters an example collection of Documents using Lingo3G tuned to return slightly fewer clusters than by default, but with increased hierarchy depth.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put values using attribute builders */
            CommonAttributesDescriptor
                .attributeBuilder(attributes)
                    .documents(SampleDocumentData.DOCUMENTS_DATA_MINING);

            Lingo3GClusteringAlgorithmDescriptor
                .attributeBuilder(attributes).attributes()
                    .maxHierarchyDepth(3)
                    .clusterCountBase(4);
            
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Lingo3GClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Lingo3G. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();

Pooling of processing component instances, caching of processing results

The examples shown above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.

        /*
         * Create the caching controller. You need only one caching controller instance
         * per application life cycle. This controller instance will cache the results
         * fetched from any document source and also clusters generated by the Lingo3G
         * algorithm.
         */
        final Controller controller = ControllerFactory.createCachingPooling(
            IDocumentSource.class, Lingo3GClusteringAlgorithm.class);

        /*
         * Before using the caching controller, you must initialize it. On initialization,
         * you can set default values for some attributes. In this example, we'll set the
         * default results number to 50, set the API key and set Lingo3G to generate three
         * levels of cluster hierarchy.
         */
        final Map<String, Object> globalAttributes = new HashMap<String, Object>();
        CommonAttributesDescriptor
            .attributeBuilder(globalAttributes)
                .results(50);

        /* Put your own API key here */
        Bing5DocumentSourceDescriptor
        .attributeBuilder(globalAttributes)
            .apiKey(BingKeyAccess.getKey());

        Lingo3GClusteringAlgorithmDescriptor
            .attributeBuilder(globalAttributes).attributes()
                .maxHierarchyDepth(3);

        controller.init(globalAttributes);

        /*
         * The controller is now ready to perform queries. To show that the documents from
         * the document input are cached, we will perform the same query twice and measure
         * the time for each query.
         */
        ProcessingResult result;
        long start, duration;

        final Map<String, Object> attributes;
        attributes = new HashMap<String, Object>();
        CommonAttributesDescriptor.attributeBuilder(attributes).query("data mining");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing5DocumentSource.class,
            Lingo3GClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (empty cache)");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing5DocumentSource.class,
            Lingo3GClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (documents and clusters from cache)");

Clustering non-English content

This example shows how to cluster non-English content. By default Lingo3G assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Lingo3G, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.

There are two ways to indicate the desired clustering language to Lingo3G:

  1. By setting the language of each document in their Document.LANGUAGE field. The language does not necessarily have to be the same for all documents on the input, the algorithm can handle multiple languages in one document set as well. Please see the MultilingualClustering.languageAggregationStrategy attribute for more details.
  2. By setting the fallback language. For documents with undefined Document.LANGUAGE field, Lingo3G will assume the some fallback language, which is English by default. You can change the fallback language by setting the MultilingualClustering.defaultLanguage attribute.

If the language of the documents in unknown it can be detected automatically by setting the language-recognition attribute to true.

        try (Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class)) {
          /*
           * In the first call, we'll cluster a document list, 
           * setting the language for each document separately.
           */
          final List<Document> documents = Lists.newArrayList();
          for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
          {
              documents.add(new Document(
                  document.getTitle(), 
                  document.getSummary(),
                  document.getContentUrl(), 
                  LanguageCode.ENGLISH));
          }
  
          /* Prepare attributes */
          final Map<String, Object> attributes = new HashMap<String, Object>();
          CommonAttributesDescriptor
              .attributeBuilder(attributes)
                  .documents(documents);
  
          /* Perform clustering and display results */
          System.out.println("Clustering in English");
          ConsoleFormatter.displayClusters(controller.process(attributes,
              Lingo3GClusteringAlgorithm.class).getClusters(), 3);
  
          /*
           * If none of the documents have the language property set and
           * language detection is disabled, the fallback
           * clustering language will be English. You can change the global fallback
           * language using the "MultilingualClustering.defaultLanguage" attribute.
           */
          final String germanQuery = "bundestag";
          final List<Document> germanDocuments = getGermanDocuments(germanQuery);
  
          /* Prepare attributes */
          attributes.clear();
          CommonAttributesDescriptor
              .attributeBuilder(attributes)
                  .documents(germanDocuments)
                  .query(germanQuery);
  
          // Disable automatic language recognition.
          Lingo3GClusteringAlgorithmDescriptor.attributeBuilder(attributes)
              .attributes()
              .languageRecognition(false);
  
          // Set the default fallback language.
          Lingo3GClusteringAlgorithmDescriptor.attributeBuilder(attributes)
              .multilingualClustering()
                  .defaultLanguage(LanguageCode.GERMAN);
  
          /* Perform clustering and display results */
          System.out.println("Clustering in German (enforced)");
          ConsoleFormatter.displayClusters(
              controller.process(attributes, Lingo3GClusteringAlgorithm.class).getClusters(), 3);
  
          /*
           * If you don't know the language in advance, language recognition will handle
           * this for you.
           * 
           * In that case, Lingo3G will try to determine the language for each document,
           * perform clustering for each language separately and aggregate the results
           * according to the "MultilingualClustering#languageAggregationStrategy"
           * attribute.
           */
          List<Document> mixedLanguages = Lists.newArrayList();
          mixedLanguages.addAll(getGermanDocuments("bundestag"));
          mixedLanguages.addAll(getSpanishDocuments("parlamento"));
          mixedLanguages.addAll(getFrenchDocuments("parlement"));
  
          /* Prepare attributes */
          attributes.clear();
          CommonAttributesDescriptor
              .attributeBuilder(attributes)
                  .documents(mixedLanguages);
          // Enable language recognition.
          Lingo3GAttributesDescriptor.attributeBuilder(attributes)
              .languageRecognition(true);
          // And define no-flattening strategy so that we can see which languages have been
          // discovered. Note that very short snippets of text may be classified incorrectly. 
          Lingo3GClusteringAlgorithmDescriptor.attributeBuilder(attributes)
              .multilingualClustering()
                  .languageAggregationStrategy(LanguageAggregationStrategy.FLATTEN_NONE);
  
          /* Perform clustering and display results */
          System.out.println("Clustering a mix of languages (French, German, Spanish)");
          ConsoleFormatter.displayClusters(controller.process(attributes,
              Lingo3GClusteringAlgorithm.class).getClusters(), 3);

 

Skip navigation links

Copyright (c) Dawid Weiss, Stanislaw Osinski