Over the last couple of years I have been doing research in the field of
"Independent Self Hosting Web Services (ISHWS)" that could serve as substitutes for
Google and other web related services.
The reason is it's become apparent to me that the "Internet" isn't really a
very safe place in regards to "Personal Security, Privacy and Censorship". I think
we all can agree on that.
The following are considered primary category's, once these problems are addressed
we can then start to consider sub-category's.
1. Search Engine and Web Crawler
2. Cloud Storage
3. Code and Documentation Hosting
4. E-Mail Server
5. Voip (SIP, XMPP, WebRTC, VoiceMail,...)
6. Web Site, Blog, Forums, Social Networking, ...
At first glance you may dismiss the idea of (ISHWS) and view the concept as an
un-realistic goal in regards to competing with company's like Google, Yahoo, Facebook, ...
You very well may be right, but then again when faced with this complex problem the solution
has always been to address the issues by trying to solve the problem as a whole, instead of
breaking it up into manageable portions.
I'm going to attempt to do just that.
There has been one benefit over the years in relation to "Personal Privacy" and that has been "Data Mining" the benefit in our case is we now have a clear picture of the average "Net Citizens" internet usage.
For instance take the US ...
1. Time Spent Online - 32 Hours(Month)
2. 78.90% of American Citizens have access to the Internet.
Email = 92%
Using Search Engines = 92%
Health or Medical Info = 83%
Hobbies = 83%
Searching for Directions = 82%
Check Weather = 81%
Info Search on Buying Products = 78%
Reading News = 76%
Entertainment = 72%
Buy a Product = 71%
Top 10 Most Visited Web Sites:
Highest Growing Trends:
1. Location Based Services = 27%
2. Timeshift Tv = 27%
3. Internet Banking = 19%
One of the hardest problem to solve is "Internet Search", there are thirdpary api's that can help such as "Yahoo BOSS"
but there is two problems with this ...
1. Defeat's the purpose of ISHWS
2. Acquiring every "Index Result" that Yahoo currently posses would cost in excess of $4 Billion, and that's not even counting duplicate entry's that are common with the BOSS Api. And this relates to site search results only.
I performed a test on my self last year I recorded every unique URL I visited for the entire year of 2013, interesting enough I only visited 1000 unique web sites, and 90% of those were relevant to my personal interests. Now imagine if
I would have index those 1000 sites prior to performing my test, granted its not likely that I would have know what sites
I was going to visit for the year before visiting them, but what if I did.
Now imagine if I was running this mysterious platform described above and in relation to content that was searchable I thoroughly index content that was related to my interest, and my interests alone. And what if say 9 others did the same but all 10 of us
were connected in some way and had similar interests? ask your self this how much of the content that you come across is actually valuable to you and not considered junk by your point of view?
Now lets assume that I have 15 different category's of interest,.
3. Cooking (Ethnic, From Scratch)
4. News (Tech and World)
5. Computer Graphics (2D/3D, VFX(Movies), Games)
6. Books (Classics)
7. Electrical Engineering (Arduino, Raspberry Pi, Robotics and Communications)
8. Bush Crafting (Wilderness Thrival)
10. Indie Music (Jamendo, Magnatunes, ...)
11. Short Stories (Screenplays, or Science Fiction, Ancient Cultures, Science Non-Fiction, Adventure type Stories)
Ok that's not 15 but good enough, so out of the 11 that I listed how many of my interests are interesting to you?
Now what if my other 9 friends that shared half my interests but also had 7 or 8 independent interest of there own.
Lets assume all data that the 10 of us had indexed was of the highest quality. Meaning there was no advertisements or what I like to refer as junk content. And each of us had a storage of around 100,000
documents that translate'd to a index database around 1GB give or take.
Could you even imagine having 1 Million documents indexed between 10 people containing data of the highest quality to search?
Granted it might only be on 75 different interests and god knows how many topics but that's besides the point.
Think about that for a minute.
Another interesting "experiment" I had done was to take the "Google Search, Yahoo BOSS, and Faroo Search api's" and created a mock internet search engine using Python and Sphinx Full-Text Search Engine. The data that I was most interested in was the "Site URL and Summary". I had collected around 10,000 query results that I then saved in an xml file scheme that I had created for easy extraction and parsing. The goal of the experiment was to see how
many results per TB of storage I could archive. Interesting enough
I could have stored around 1 billion results per TB.
If we were to attempt to crawl and index sites in the same fashion as Google or Yahoo, not only would this be cost prohibited but also extremely inefficient.
But what if we narrowed our scope to """1 Million documents indexed between 10 people containing data of the highest quality to search?""" At this scale the problem is far more manageable.
Taking all of the above into consideration, lets look at programming language and frameworks we can use.
I choose python mainly because of the shear volume of support in regards to the above.
Web Framework: Flask
Full-Text Search: Whoosh
Documentation Generator: Sphinx
Cloud Storage: Tahoe-LAFS
Version Control System: Mercurial
SIP, XMPP, WebRTC, VoiceMail, .. Plivo