New CI Tests on DBpedia Releases

Hi all,
we looked at the CI tests in the framework: https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia

and since we do more frequent releases now at http://databus.dbpedia.org/marvin/ we devised a new testing methodology. that works on artifacts and S, P, O individually.
We published a draft vocabulary using triggers and validators for IRIs: https://github.com/dbpedia/extraction-framework/blob/master/new_release_based_ci_tests_draft.ttl

Coverage is defined as:
C_s := Subjects with at least one trigger / all subjects
C_p := same for predicate
C_o := same for objects
C_total = AVG (C_s, C_p, C_o)

So we can measure each part of the triples individually and more systematically.
The first test we will implement is:

v:doesNotContainCharacters "&" , " ", "?", "\"", "'" .
for dbpedia.org/resource IRIs

1 Like

Hi, I wrote a first iri test prototype using apache spark and that is based on the earlier posted test case file.

It implements the regex based trigger functions and for the validators the v:regexPattern, v:doesDotContainChar statements are working. But v:basedOnVocab is still missing, so DBpedia ontology properties dont get validated/tested correctly.


Here are some first coverage results for selected artifacts of the mappings and wikidata group.

mappings/geo-coordinates-mappingbased/2019.07.01
C_s: 1.0 all: 2296717 trg: 2296717 vld: 2255526
C_p: 0.6923077 all: 13 trg: 9 vld: 0
C_o: 0.9999836 all: 61031 trg: 61030 vld: 60853
C_T: 0.8974304
mappings/instance-types/2019.07.01
C_s: 1.0 all: 75239131 trg: 75239131 vld: 72782555
C_p: 0.0 all: 1 trg: 0 vld: 0
C_o: 0.9292124 all: 1003 trg: 932 vld: 294
C_T: 0.64307076
mappings/mappingbased-literals/2019.07.01
C_s: 1.0 all: 67051956 trg: 67051956 vld: 64892384
C_p: 0.98234 all: 1359 trg: 1335 vld: 0
C_o: 0.0 all: 0 trg: 0 vld: 0
C_T: 0.66078
mappings/mappingbased-objects-uncleaned/2019.07.01
C_s: 0.99999994 all: 52245414 trg: 52245414 vld: 50500891
C_p: 0.98216057 all: 1009 trg: 991 vld: 0
C_o: 0.23599745 all: 30796889 trg: 7267987 vld: 7130245
C_T: 0.739386
mappings/specific-mappingbased-properties/2019.07.01
C_s: 1.0 all: 1684622 trg: 1684622 vld: 1648354
C_p: 1.0 all: 80 trg: 80 vld: 0
C_o: 0.0 all: 0 trg: 0 vld: 0
C_T: 0.6666667

wikidata/geo-coordinates/2019.07.01
C_s: 1.0 all: 7244176 trg: 7244176 vld: 0
C_p: 0.0 all: 4 trg: 0 vld: 0
C_o: 0.0 all: 1 trg: 0 vld: 0
C_T: 0.33333334
wikidata/instance-types/2019.07.01
C_s: 1.0 all: 16355016 trg: 16355016 vld: 0
C_p: 0.0 all: 1 trg: 0 vld: 0
C_o: 0.93421054 all: 684 trg: 639 vld: 296
C_T: 0.6447368
wikidata/mappingbased-literals/2019.07.01
C_s: 1.0 all: 31073498 trg: 31073498 vld: 0
C_p: 1.0 all: 47 trg: 47 vld: 0
C_o: 0.0 all: 0 trg: 0 vld: 0
C_T: 0.6666667
wikidata/mappingbased-objects-uncleaned/2019.07.01
C_s: 1.0 all: 46739096 trg: 46739096 vld: 0
C_p: 0.9882353 all: 85 trg: 84 vld: 0
C_o: 0.036438867 all: 80713652 trg: 2941114 vld: 0
C_T: 0.6748914
wikidata/sameas-all-wikis/2019.07.01
C_s: 0.99999994 all: 24726699 trg: 24726699 vld: 0
C_p: 0.0 all: 1 trg: 0 vld: 0
C_o: 0.98954886 all: 65120906 trg: 64440318 vld: 63382822
C_T: 0.6631829

After testing i realised that for “trigger:wikidata_dbpedia_extraction” no validator was assigned.

Hi all,
CI Tests are working

Run on Minidump

git clone https://github.com/dbpedia/extraction-framework.git
cd dump
mvn test

Configure tests

https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/resources/dbpedia-specific-ci-tests.ttl
currently is this here:

validator:dissallowed_chars
	a v:IRI_Validator ;
   rdfs:comment """Dissallowed in URIs, cf. https://www.ietf.org/rfc/rfc3987.txt: 	Systems accepting IRIs MAY also deal with the printable characters in    US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`", in step 2 above.  If these characters are found but are not converted, then the conversion SHOULD fail.  Please note that the number sign ("#"), the percent sign ("%"), and the square bracket characters ("[", "]") are not part of the above list and MUST NOT be converted.  """ ;
	v:doesNotContain "<" , ">", "\"" , " ", "{", "}", "|", "\\", "^" , "`" .

validator:dbpedia_resource_delims 
	a v:IRI_Validator ;
       rdfs:comment """ 
	1. gen-delims are not allowed, except ":" and "@" per rfc3987
	"ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@" "
	2. sub-delims are allowed:
	These are allowed in DBpedia Uris, so we check that they are not encoded
	sub-delims  =  "%21", "%24", "%26", "%27", "%28", "%29", "%2A", "%2B", "%2C", "%3B", "%3D"
    sub-delims = "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "="
	reserved gen-delims from above """ ;
	v:doesNotContain  "?", "#", "[", "]" ;
	v:doesNotContain  "%21", "%24", "%26", "%27", "%28", "%29", "%2A", "%2B", "%2C", "%3B", "%3D" .

Extend the minidump

Add more Wikipedia articles to the minidump here:

Results


Cov_s: 1.0 ( 5 triggered of 5 total ), Success_rate_s: 1.0 ( 5 )
Cov_p: 1.0 ( 33 triggered of 33 total ), Success_rate_p: 1.0 ( 33 )
Cov_o: 0.9464286 ( 106 triggered of 112 total ), Success_rate_o: 1.0 ( 106 )
Cov:   0.98214287

This will end faulty URIs and datatypes