WIP

This part of the documentation will be converted to notebooks under tutorials.

Transformation Engine#

The transformation engine is the core of NEAT. It is responsible for transforming the source knowledge graph to the target knowledge graph. The transformation engine is implemented as a set of transformation rules (aka, transformation directives) which are defined in the Rules object. The transformation rules are defined in the Properties sheet of the Rules object. The transformation rules are executed in the order they are defined in the Properties sheet.

Transformation Rules Types#

In NEAT we are currently supporting three types of transformation rules (aka, transformation directives), namely:

sparql
rdfpath
and rawlookup.

rdfpath directives resolve in SPARQL queries which are executed against rdf knowledge graph stored in rdf triple graph databases configured through NeatGraphStore. In NEAT we have 4 different subtypes of rdfpath directives which are used in different situations to transform the source knowledge graph to the targeted knowledge graph.

On the other side, rawlookup resolves as combination of sparql query against NeatGraphStore and query against CDF RAW. This type of rule is used to enrich the target knowledge graph with the information which were not present in the source knowledge graph, though they can also serve as a transformation of non linked data to linked data.

Before we deep dive in explaining the directive types and subtypes in more details, it is worth at this point to introduce concept of prefixes, namespaces and entity (classes, properties and instances) references. In the rest of this page you will notice that we are using the following notation:

prefix:EntityName

where prefix is short name of a namespace in which the entity is defined and EntityName is the name of the entity. Together they form entity reference, which is a globally unique identifier (URI) for given entity. For example in case of the entity cim:Substation, prefix cim corresponds to namespace http://iec.ch/TC57/2013/CIM-schema-cim16#, while Substation corresponds to a specific class in the cim namespace.

SPARQL engine translates this short form into http://iec.ch/TC57/2013/CIM-schema-cim16#Substation. As one can see, the short form is much more readable and easier to write than the long URI.

`sparql` rule#

The most flexible rule type in NEAT. It basically contains raw SPARQL query which is executed against the source knowledge graph (stored in NeatGraphStore). The result of the query is then used to create triples in the target knowledge graph (stored in NeatGraphStore). You have fully flexibility to define the query, however, you need to be aware of he query must return three columns: subject, predicate and object.

Therefore, it is recommended to use sparql rule type only when you are familiar with SPARQL. Use NEAT UI to test your query and make sure it returns the expected results before you use it in the rule.

Limitations

NEAT is not capable of detecting syntax errors in the SPARQL query.

`rdfpath` rule: SingleProperty#

This query type is used to get a single property from all class instances. It is defined in the Excel file by rdfpath value prefix:ClassName(prefix:PropertyName).

Let's look at one row of Excel file in the Properties sheet:

Class	Property	...	Rule Type	Rule
Substation	mRID	...	rdfpath	cim:Substation(IdentifiedObject.mRID)

This row defines a query that will get all values of property IdenetifiedObject.mRID of all Substation instance from the source graph (also known as domain graph) and store them as values of property mRID of Substation instances in the target graph (also known as solution or application graph).

Beware that the property and class references in the target graph will use the target graph prefix and namespace. The prefix and namespace are derived using the field prefix in the Metadata sheet, which defines the prefix, combined with namespace that is also part of Metadata sheet, combination of the two form full URI such as http://purl.org/cognite/prefix#.

We omit to write prefix when we define target classes and properties, thus we consider the target prefix to be implicit. However, the original references of the source objects will be used in the target graph (e.g., references of substation instances). The retention of original references (i.e., avoiding to rename their original namespace to target namespace) helps debugging and comparing objects between graphs.

The query is defined by rdfpath value cim:Substation(IdentifiedObject.mRID) is converted by the transformation engine into the following SPARQL query:

PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object
    WHERE {
        ?subject a cim:Substation .
        ?subject cim:IdentifiedObject.mRID ?object .
        BIND(cim:IdentifiedObject.mRID AS ?predicate)
        }

The above query will result in triples (subject, predicate, object), where:

Subject is the reference of the instance
Predicate is the property which value we are looking to get
Object is the value of the property

Here is an example of the result:

subject	predicate	object
cim:Substation.1	cim:IdentifiedObject.mRID	"f176964e-9aeb-11e5-91da-b8763fd99c5f"
cim:Substation.2	cim:IdentifiedObject.mRID	"b176964e-9aeb-11e5-91da-b8763fd99c5f"

as mentioned earlier, NEAT will convert predicate to the target namespace and prefix. In this case the predicate will be http://purl.org/cognite/prefix#IdentifiedObject.mRID, or in short prefix:IdentifiedObject.mRID.

`rdfpath` rule: AllReferences#

This query type is used to get references of all instance of a given class. It is defined in the Excel file by rdfpath value prefix:ClassName. Let's look at one row of Excel file in Properties sheet:

Class	Property	...	Rule Type	Rule
Substation	mRID	...	rdfpath	cim:Substation

This row defines a query that will get all references of all Substation instance from the source graph (also known as domain graph) and store them as values of property "mRID" of Substation instances in the target graph (also known as solution or app graph). The query is defined by rdfpath value cim:Substation.

NEAT will in return create this SPARQL query:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object
    WHERE {
            ?subject a cim:Substation
                {
                BIND(?subject AS ?object)
                BIND(dct:identifier AS ?predicate)
                }
          }

One can notice that we have the BIND statements. These BIND statements are guarantee that the result of the query will be list of triples (subject, predicate, object), where: - Subject is the reference of the instance - Predicate is dct:identifier, which is used as temporarily predicate before it is converted to the target property - Object is also the reference of the instance

Here is an example of the result:

subject	predicate	object
cim:Substation.1	dct:identifier	cim:Substation.1
cim:Substation.2	dct:identifier	cim:Substation.2

In the follow up step, NEAT will convert the predicate to the target namespace and prefix. In this case the predicate will be http://purl.org/cognite/prefix#IdentifiedObject.mRID, or in short prefix:IdentifiedObject.mRID.

This query type if often used as convenience in case when for example we are missing some properties in the source graph and we want to get them in the target graph. For example, in case of a TSO customer, there was a number of class instances which were missing mRID property. We have created them in the target graph using the above rule example, where we in addition specify that the namespace should be dropped, so converting URI to literal value. For example, cim:Substation.1 will be converted to "Substation.1".

`rdfpath` rule: AllProperties#

This query type is used to get all properties of all instance of a given class. It is defined in the Excel file by rdfpath value *. Let's look at one row of Excel file in Properties sheet:

Class	Property	...	Rule Type	Rule
Substation	*	...	rdfpath	cim:Substation(*)

This row defines a query that will get all properties of all Substation instance. The query is defined by rdfpath value cim:Substation(*). The * character is a wildcard that means that we want to get all properties of a given class instance.

NEAT will in return create this SPARQL query:

PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object
    WHERE {
        ?subject a cim:Substation .
        ?subject ?predicate ?object .
        }

The above query will result in triples (subject, predicate, object) that define all Substation instances. Here is an example of the result:

subject	predicate	object
cim:Substation.1	rdf:type	cim:Substation
cim:Substation.1	cim:Substation.Region	cim:SubGeographicalRegion.1
cim:Substation.1	cim:Substation.EquipmentContainer	cim:VoltageLevel.1
cim:Substation.1	cim:IdentifiedObject.name	"Substation 1"
cim:Substation.1	cim:IdentifiedObject.description	"Substation 1"
cim:Substation.1	cim:IdentifiedObject.mRID	"Substation.1"
cim:Substation.2	rdf:type	cim:Substation
cim:Substation.2	cim:Substation.Region	cim:SubGeographicalRegion.1
cim:Substation.2	cim:Substation.EquipmentContainer	cim:VoltageLevel.2
. . .	. . .	. . .

Warning

AllProperties rdfpath rule and corresponding query is discouraged since it can result in a huge amount of data. Also, since we are not controlling what corresponding property it will land in the solution graph, it can result in a lot of duplicates. In addition, the original property references will not be converted to the target namespace and prefix. For example, cim:Substation.Region will not be converted to prefix:Substation.Region.

`rdfpath` rule: Hop#

This query type is used to traverse (aka hop) the source graph and get desired class instance references or their properties.

Hop which fetches all references#

In the most typical scenario we are extracting desired class instances from the source graph and storing them in the target graph under new property which did not existed in the source graph. By doing this we are "shortening" the path and query time in the target graph. This process also "flattens" the target graph in comparison to the source graph.

Let's look at picture bellow to see how this works:

In the above picture we are hopping (i.e. graph traversing) from Terminal to Substation via intermediate nodes ConnectivityNode and VoltageLevel. Our desire is to extract connections between terminals and substations and make them directly connected in the target graph. As we can see from the picture, our graph is directional, so there are properties which connect:

Terminal to ConnectivityNode
ConnectivityNode to VoltageLevel
VoltageLevel to Substation

Accordingly, we have rdfpath rules which define the hops as shown in the one of the rows of Excel file in Properties sheet:

Class	Property	...	Rule Type	Rule
Terminal	Terminal.Substation	...	rdfpath	cim:Terminal->cim:ConnectivityNode->cim:VoltageLevel->cim:Substation

One can notice that arrows "->" indicate directions that nodes are connected. These arrows tell NEAT to besides generating SPARQL query also find and insert properties for us (so we do not need to know them by heart).

The resulting SPARQL query for the above rule looks like this:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object
    WHERE { ?subject a cim:Terminal .
            ?subject cim:Terminal.ConnectivityNode ?ConnectivityNodeID .
            ?ConnectivityNodeID cim:ConnectivityNode.ConnectivityNodeContainer ?VoltageLevelID .
            ?VoltageLevelID cim:VoltageLevel.Substation ?object .
            ?object a cim:Substation .
            BIND(dct:relation AS ?predicate) }

The above query will result in triples (subject, predicate, object) that define all Terminal instances and their Substation references. Here is an example of the result:

subject	predicate	object
cim:Terminal.1	dct:relation	cim:Substation.1
cim:Terminal.2	dct:relation	cim:Substation.1
cim:Terminal.3	dct:relation	cim:Substation.2
cim:Terminal.4	dct:relation	cim:Substation.2
...	...	...

Yet again here, similarly like in case of AllReferences, we are using dct:relation as a temporary predicate. In the downstream processes, NEAT will convert the predicate to the target property Terminal.Substation with the corresponding namespace and prefix. In this case the predicate will be http://purl.org/cognite/prefix#Terminal.Substation, or in short prefix:Terminal.Substation.

The hop rdfpath can be bidirectional, example in case when we want to check to what ACLineSegments Substations are connected:

cim:Substation<-cim:VoltageLevel<-cim:ConnectivityNode<-cim:Terminal->cim:ACLineSegment

which results in the following SPARQL query:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object WHERE {
    ?subject a cim:Substation . ?VoltageLevelID
    cim:VoltageLevel.Substation ?subject .
    ?ConnectivityNodeID cim:ConnectivityNode.ConnectivityNodeContainer ?VoltageLevelID .
    ?TerminalID cim:Terminal.ConnectivityNode ?ConnectivityNodeID .
    ?TerminalID cim:Terminal.ConductingEquipment ?ACLineSegmentID .
    ?object a cim:ACLineSegment
    BIND(dct:relation AS ?predicate) }

Hop which fetches single property#

The above hop rule can be extended to grab a specific propety for example cim:IdentifiedObject.name:

cim:Substation<-cim:VoltageLevel<-cim:ConnectivityNode<-cim:Terminal->cim:ACLineSegment(cim:IdentifiedObject.name)

which results in the following SPARQL query:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX cim: <http://iec.ch/TC57/2013/CIM-schema-cim16#>

SELECT DISTINCT ?subject ?predicate ?object WHERE {
    ?subject a cim:Substation . ?VoltageLevelID
    cim:VoltageLevel.Substation ?subject .
    ?ConnectivityNodeID cim:ConnectivityNode.ConnectivityNodeContainer ?VoltageLevelID .
    ?TerminalID cim:Terminal.ConnectivityNode ?ConnectivityNodeID .
    ?TerminalID cim:Terminal.ConductingEquipment ?ACLineSegmentID .
    ?ACLineSegmentID a cim:ACLineSegment .
    ?ACLineSegmentID cim:IdentifiedObject.name ?object .
    BIND(cim:IdentifiedObject.name AS ?predicate) }

Limitation

Hops which fetch all the properties are not yet supported in NEAT.

Transformation Engine#

Transformation Rules Types#

sparql rule#

rdfpath rule: SingleProperty#

rdfpath rule: AllReferences#

rdfpath rule: AllProperties#

rdfpath rule: Hop#