UseCase "Manage indexing resource silos (entitizing and feature extraction and synthesis)"
|
Content
IsCarriedOutBy SearchEngineer
Example: indexing a Semantic MediaWiki
When you index a Semantic MediaWiki (which is a resource silo), you create a file named something like myIndexingJob.rb:
Elasticsearch
- oESC = Dataspects::ElasticsearchCluster
- sIndexName
TIKA
- $sTIKAServerURL
Semantic MediaWiki
- oSMW = Dataspects::SemanticMediaWiki
Facets containing the pages to index
- oSMWPages = Dataspects::Facet.from_oSEMANTICMEDIAWIKI(oSMW).from_mCATEGORIES('Subject') do |oResource|
Resource entitization
A resource instance (e.g. a Semantic MediaWiki page) can instantiate a ResourceEntitizer passing itself and e.g. specify that each section represents an entity. (Example, see aEntities() for Dataspects::SemanticMediaWikiPage.
Entities annotation
See Concept "Compiling Dataspects::Entity annotations"
Annotations for simple SMW entities (i.e. a SMW page represents nothing but an entire entity) are collected like this:
- by your customized version of
aEntities
forDataspects::SemanticMediaWikiPage < Dataspects::Resource
e.g. declared in your myIndexingJob.rb - by
aEntities
inDataspects::SemanticMediaWikiPage < Dataspects::Resource
- by
initialize
inDataspects::Subject < Dataspects::Entity
- by
initialize
inDataspects::Entity
How to compile an entity's Elasticsearch document
E.g. an instance of Dataspects::SemanticMediaWikiPage < Dataspects::Resource < Object
instantiates these variables:
@oResourceSilo @oFullHTMLSource @hBrowseBySubject @sHasResourceType @sHasResourceName @sHasResourceURL
Entitization
Dataspects::SemanticMediaWikiPage.aEntities
then entitizes this resource.
Dataspects::SemanticMediaWikiPage.aEntities
can be customized by redeclaring it in myIndexingJob.rb.
Simple entities
If the resource represents a simple SMW entity (that is the page as a whole represents a single entity), then Dataspects::SemanticMediaWikiPage.aEntities
contains one single oEntity
which is a Dataspects::Subject < Dataspects::Entity < Object
.
Populate an entity's Elasticsearch document fields
hEntityDoc = {
####################################################################################################################################################################
# added to Dataspects::Entity.aAnnotations Dataspects::Entity.initialize
# default to values set in Dataspects::SemanticMediaWikiPage.initialize and if not set there, then in Dataspects::Resource < Object
# Resource silo level
OriginatedFromResourceSiloID: oEntity.get_sObjectValue_for_sPREDICATENAME('OriginatedFromResourceSiloID'),
# Resource level
HasResourceName: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceName'),
HasResourceURL: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceURL'),
HasResourceType: oEntity.get_sObjectValue_for_sPREDICATENAME('HasResourceType'),
####################################################################################################################################################################
# Entity/subject level
# added to Dataspects::Entity.aAnnotations in Dataspects::Subject.initialize
HasEntityClass: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityClass'),
# added to Dataspects::Entity.aAnnotations in Dataspects::Entity.initialize
# default to values set in Dataspects::SemanticMediaWikiPage < Resource < Object and if not set there, then in Dataspects::Resource < Object
HasEntityName: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityName'),
HasEntityType: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityType'),
HasEntityTitle: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityTitle'),
HasEntityContent: oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityContent'),
HasEntityKeywords: oEntity.get_aObjectValues_for_sPREDICATENAME('HasEntityKeyword'),
# Synthetic
HasEntityTypeAndEntityTitle: "#{oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityType')} \"#{oEntity.get_sObjectValue_for_sPREDICATENAME('HasEntityTitle')}\"",
# Quadruples
HasEntityAnnotations: oEntity.get_aHasEntityAnnotations
}