Some of my smart colleagues at Google have run a few heuristics and algorithms in order to discover Wikipedia articles in different languages about the same topic which are missing language links between the articles. The results contain more than 35,000 missing links with a high confidence according to these algorithms. We estimate a precision of about 92+% (i.e. we assume that less than 8% of those are wrong, based on our evaluation). The dataset covers 60 Wikipedia language editions.
Here are the missing links, available for download from the WMF labs servers:
The data is published under CC-0.
What can you do with the data? Since it is CC-0, you can do anything you want, obviously, but here are a few suggestions:
There’s a small tool on WMF labs that you can use to verify the links (it displays the articles side by side from a language pair you select, and then you can confirm or contradict the merge):
The tool does not do the change in Wikidata itself, though (we thought it would be too invasive if we did that). Instead, the results of the human evaluation are saved on WMF labs. You are welcome to take the tool and extend it with the possibility to upload the change directly on Wikidata, if you so wish, or, once the data is verified, to upload the results.
Also, Magnus Manske is already busy uploading the data to the Wikidata game, so you can very soon also play the merge game on the data directly. He is also creating the missing items on Wikidata. Thanks Magnus for a very pleasant cooperation!
I want to call out to my colleagues at Google who created the dataset - Jiang Bian and Si Li - and to Yicheng Huang, the intern who developed the tool on labs.
I hope that this small data release can help a little with further improving the quality of Wikidata and Wikipedia! Thank you all, you are awesome!
But my 20% time is geared towards that :)
So we've decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project. Freebase has also supported developer access to the data, so before we retire it, we’ll launch a new API for entity search powered by Google's Knowledge Graph.
Loading Freebase into Wikidata as-is wouldn't meet the Wikidata community's guidelines for citation and sourcing of facts -- while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault, so these individual facts can then be properly loaded to Wikidata.
We believe this is the best first step we can take toward becoming a constructive participant in the Wikidata community, but we’ll look to continually evolve our role to support the goal of a comprehensive open database of common knowledge that anyone can use.
Here are the important dates to know:
Before the end of March 2015
- We’ll launch a Wikidata import review tool
- We’ll announce a transition plan for the Freebase Search API & Suggest Widget to a Knowledge Graph-based solution
March 31, 2015
- Freebase as a service will become read-only
- The website will no longer accept edits
- We’ll retire the MQL write API
June 30, 2015
- We’ll retire the Freebase website and APIs
- The last Freebase data dump will remain available, but developers should check out the Wikidata dump
The Knowledge Graph team at Google
'Sex' is a property used for items about people or animals to specify the sex of the subject of an item. In most cases it is 'sex:male' or 'sex:female' but there can be other values.
The first question that arose was whether 'man' and 'woman' would be better values that 'male' and 'female'. That was resolved by noting it is intended that this property be used for animals other than humans - there are wikipedia articles on a number of racehorses for instance. It was agreed that men would be 'instance of(p31):human(Q5)' and 'sex(p21):male(Q6581097)' and similarly for females.
The next issue that arose was that in many languages there is a sharp distinction between male humans and male animals with different words used for the different concepts. We considered creating a different property for the sex of animals but in the end we kept P21 the same and added too new items that P21 could link to - male animal (Q44148) and female animal (Q43445).
If you look at the labels in other languages you can see the difference:
English: male and male animal
German: männlich and männliches Geschlecht
Spanish: masculino and macho
French: masculin and mâle
Nederlands: man and mannelijk
The next issue that came up was highlighted by the Chelsea Manning affair. How are we to describe the sex of Chelsea Manning?
I will summarise the arguments raised below, together with the responses to these arguments:
Sex and Gender are different things. Combining them in one property is a poor design choice. We should have separate properties for these.
Response: There are millions of humans on wikidata and for most of them we are judging their sex from their names and how they appeared to whatever sources wrote about them - in other words their gender presentation. There are very few people for whom we have reliable information on their genitalia, much less their chromosomes. If we have 2 different specific properties then we need specific information to use these properties which means that in practice we will not be able to use these properties on most humans. Better to have an ambiguous property that reflects the information we have.
Response: We don't have to leave it blank. We can use the information we have, based on names etc. and we will be right most of the time - there aren't that many transgender people anyway and there are even fewer clandestine transgender people.
Response: There are a lot of people on Wikidata. If we are wrong 1% of the time that is thousands of people. We should find a way to express this that reflects our lack of specific information.
The conclusion was to have one ambiguous property and use qualifiers in those specific cases where we have more information. We changed the English label for this property from 'Sex' to 'sex or gender' with aliases 'gender identity', 'gender expression', 'gender', 'biological sex', 'man', 'woman', 'male', 'female', 'intersex', 'sex. The English description was changed to "male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503), genderqueer (Q48270); for animals use male animal (Q44148) or female animal (Q43445). Add qualifiers as appropriate."
'Qualifiers' as mentioned here refers to a feature of wikidata. Each statement on wikidata can have qualifiers to give additional information. For example: 'sex or gender:male' can have the qualifier 'end date:3 June 2012' meaning that the statement 'sex or gender:male' is only true up till that date.
As well as qualifiers wikidata also allows a property to have more that one value, each with it's own qualifiers. So we can have:
'sex or gender:male', 'end date:3 June 2012'
and 'sex or gender:transgender female', 'start date:4 June 2012'
This blog post just gives a brief and personal viewpoint of the discussions on the talk page for property P21. For the whole discussion see https://www.wikidata.org/wiki/Property_talk:P21
For the ontological side of this topic, see "Representing the Reality Underlying Demographic" Data http://ceur-ws.org/Vol-833/paper20.pdf and "Gender versus sex – why you should observe the difference" http://www.dorisandbertie.com/goodcopybadcopy/2009/05/18/gender-versus-sex-%E2%80%93-why-you-should-observe-the-difference/
When it comes to how to represent this I think the wikidata model with qualifiers for properties and a property to have more than one value is super powerful. And a model I think we should consider also in clinical trial data and health care data standards.Especially given the RDF representation of the model described in the great paper by you and co-authors http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf
- Ontologist, 2013 - present
- Wikimedia Deutschland e.V.Researcher, 2011 - 2013
- Karlsruhe Institute of TechnologyResearcher, 2004 - 2012
- University of Southern CaliforniaResearcher, 2010 - 2010
- Promotion Software GmbHContractor, 2004 - 2004
- Universität StuttgartInformatik, Philosophie, 1997 - 2004
- Geschwister Scholl Gymnasium Stuttgart SillenbuchAbitur, 1988 - 1997
- Karlsruhe Institute of TechnologyComputer Science, 2004 - 2010
Wikimedia Blog » Blog Archive » Meet the Wikidata team
The project is lead by Denny Vrandečić. He founded Semantic MediaWiki together with Markus Krötzsch and has since been looking forward to br
Physicists Seek To Lose The Lecture As Teaching Tool : NPR
Lecturing has never been an effective teaching technique, and now that information is everywhere some say it's a waste of time. Now, physici
YouTube - Game of Thrones BEST VERSION EVER - Jason Yang & Roger Lim...
Create AccountSign In. Home. BrowseMoviesUpload. Hey there, this is not a commercial interruption. You're using an outdated browser, whi
Be lucky - it's an easy skill to learn - Telegraph
Those who think they're unlucky should change their outlook and discover how to generate good fortune, says Richard Wiseman