UseCase "Manage indexing resource silos (entitizing and feature extraction and synthesis)"

From SMW CindyKate - Main
Component1262227671
Jump to: navigation, search

Content

IsCarriedOutBy SearchEngineer

Example: indexing a Semantic MediaWiki

When you index a Semantic MediaWiki (which is a resource silo), you create a file named something like myIndexingJob.rb:

Elasticsearch

  • oESC = Dataspects::ElasticsearchCluster
  • sIndexName

TIKA

  • $sTIKAServerURL

Semantic MediaWiki

  • oSMW = Dataspects::SemanticMediaWiki

Facets containing the pages to index

  • oSMWPages = Dataspects::Facet.from_oSEMANTICMEDIAWIKI(oSMW).from_mCATEGORIES('Subject') do |oResource|

Resource entitization

A resource instance (e.g. a Semantic MediaWiki page) can instantiate a ResourceEntitizer passing itself and e.g. specify that each section represents an entity. (Example, see aEntities() for Dataspects::SemanticMediaWikiPage.

Entities annotation

See Concept "Compiling Dataspects::Entity annotations"

Annotations for simple SMW entities (i.e. a SMW page represents nothing but an entire entity) are collected like this:

  1. by your customized version of aEntities for Dataspects::SemanticMediaWikiPage < Dataspects::Resource e.g. declared in your myIndexingJob.rb
  2. by aEntities in Dataspects::SemanticMediaWikiPage < Dataspects::Resource
  3. by initialize in Dataspects::Subject < Dataspects::Entity
  4. by initialize in Dataspects::Entity

How to compile an entity's Elasticsearch document

E.g. an instance of Dataspects::SemanticMediaWikiPage < Dataspects::Resource < Object instantiates these variables:

@oResourceSilo
@oFullHTMLSource

@hBrowseBySubject

@sHasResourceType
@sHasResourceName
@sHasResourceURL

Entitization

Dataspects::SemanticMediaWikiPage.aEntities then entitizes this resource.

Dataspects::SemanticMediaWikiPage.aEntities can be customized by redeclaring it in myIndexingJob.rb.

Simple entities

If the resource represents a simple SMW entity (that is the page as a whole represents a single entity), then Dataspects::SemanticMediaWikiPage.aEntities contains one single oEntity which is a Dataspects::Subject < Dataspects::Entity < Object.

Populate an entity's Elasticsearch document fields

hEntityDoc = {
  ####################################################################################################################################################################
  # added to Dataspects::Entity.aAnnotations Dataspects::Entity.initialize
  # default to values set in Dataspects::SemanticMediaWikiPage.initialize and if not set there, then in Dataspects::Resource < Object
  # Resource silo level
    OriginatedFromResourceSiloID: oEntity.get_sObjectValue_for_sPREDICATENAME('OriginatedFromResourceSiloID'),
  # Resource level
    HasResourceName: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceName'),
    HasResourceURL: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceURL'),
    HasResourceType: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceType'),
  ####################################################################################################################################################################
  # Entity/subject level
    # added to Dataspects::Entity.aAnnotations in Dataspects::Subject.initialize
    HasEntityClass: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityClass'),
    # added to Dataspects::Entity.aAnnotations in Dataspects::Entity.initialize
    # default to values set in Dataspects::SemanticMediaWikiPage < Resource < Object and if not set there, then in Dataspects::Resource < Object
    HasEntityName: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityName'),
    HasEntityType: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityType'),
    HasEntityTitle: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityTitle'),
    HasEntityContent: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityContent'),
    HasEntityKeywords: oEntity.get_aObjectValues_for_sPREDICATENAME('HasEntityKeyword'),
    # Synthetic
    HasEntityTypeAndEntityTitle: "#{oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityType')} \"#{oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityTitle')}\"",
    # Quadruples
    HasEntityAnnotations: oEntity.get_aHasEntityAnnotations
}