Shared publicly  - 
 
I have a ridiculous problem and I'm curious to see how others might solve it.

I have a number of documents in a language that I do not know (I do know the word order and text orientation of that language, so I can consider each document to be an ordered list of words that I happen to not know the meaning of). I also lack the ability to translate these documents into a language I do know.  I need to do my best (knowing that I probably won't get it exactly right) to arrange the documents in an order that makes sense so that when someone that does know the language reads them, they can make most sense of them.  That is, to the extent that some knowledge relies on other knowledge as a prerequisite, I want to present to my foreign-language client with a list of documents that puts the documents about simpler concepts first, followed by documents that build on those, etc.

How do you do it?
1
Gregg Williams's profile photoJerry Federspiel's profile photoTodd Barchok's profile photo
6 comments
 
Outsource it cheaply to someone who can read them?
 
In this post, "I" am a computer, and the foreign language is English.  If this process could be performed in an automated way, then human users could be offered personalized document sets for whatever they intend to learn about.  But if I have enough users learning about the same sorts of things, I could get data from them- so there is a way in which your answer could work.
 
I don't personally see how you can make any reasonable ordering of something, if you have zero understanding of it. I'm all up for hearing the answer though. 
 
I don't have "the" answer, but I have a possible (dumb) method.  What I really hope to do is gather a few different methods and see whether I can combine them into something that works better than any of them individually.

Here's my silly method:  Start by trying to determine the relationships between the words; if (within any given document) a word X tends to appear before a word Y, we'll pretend that X is a prerequisite for understanding Y.  To be a little more precise- the probability that X is prerequisite to Y is the probability that the position of a randomly selected occurrence of X lies before the position of a randomly selected occurrence of Y.

For each document, we run this process, trying to find what that document implies about word relationships.  We combine the resulting probabilities, maybe weighting them by document length (presumably the longer documents will give better data).

So this would hopefully tell you something like, e.g. the word "add" is highly likely to be a prerequisite for the word "multiply".

Then, we try to determine the relationships between documents, using what we know about the words.  If you choose a word from document A, and choose a word from document B, and the A word is a prerequisite for the B word, we'll treat that as evidence that document A is a prerequisite for document B.  We don't choose these words at random, because most of the time, with such a process, you'd get uninformative words (words where neither word is likely to be prerequisite for the other).  Instead, you shoot for distinctive words from each document- words that occur frequently in the document compared to how frequently they occur in English.

Once you have the relationships between documents worked out using that system, there's an off-the-shelf algorithm (topological sort) that will order them according to those relationships.

One nice thing about this method is that the first phase (where it looks at word relationships) can take advantage of well-structured documents outside of what it's trying to put in order- e.g. dictionary entries (where each definition is a "document"), wikipedia entries, etc.
 
That "dumb method" is probably the route I'd take. That, or if the documents are added manually by a user, force them to input a general "complexity score" along with it, and use that for the sorting. Though that would rely on the person inputting the document, and that's not always reliable.
 
This ( http://www.metacademy.org/ ) is what I was hoping to build towards with this question.  It's awesome that it exists, but it's clearly generated by humans (and so has a fairly narrow scope).  While that helps it give good quality results, I still think that developing an automated means to do this could help seed the DAG of concepts to ease the task of the human curators.
Add a comment...