Discussion  - 
 
Does anyone have recommendations for a database suited to log-like data? Write heavy, zero updates and will need to support moderately complex queries to facilitate some dashboards. PostgreSQL will probably be fine, but I thought it would be worthwhile to take a look around.
1
1
George Pace's profile photoPaco Nathan's profile photoAlan Musnikow's profile photoPhil Fenton's profile photo
18 comments
 
Tim - what is the purpose / application ?  Is this in a production setting or something you are doing on your own ?
 
+George Pace The db will be a sink for lots of machine-generated events/logging data. I have several worker machines crawling websites and I would like them to report to a central store. I'm preproduction, but  you can get a taste at 404ohno.com.
 
Have you looked at something called Splunk ?  It does a lot of what you are looking for ..

The link below allows you to download a free copy for 60 days - not really sure of all of the licensing info after that .. 

http://www.splunk.com/download
 
Interesting, you're the second person to recommend Splunk. Will take a look.
 
It is designed SPECIFICALLY for what you are asking for - namely log files.

It natively knows how to ingest a variety of log files layouts  (you can also define your own log files) and once ingested it has its own query language where you can whop/chop/slice/dice the data .. 

Also, I believe the data can be streamed in real-time as well .. 
 
It's a little out of 'left field' but I use sphinxsearch to do this. It can run standalone or with PostgreSQL or MySQL - I run it standalone using rt indexes. It has nice analytic functions for building reports, like full text search, groupby, filtering, sorting, avg, max, min - over millions of logs with sub second queries. Splunk uses the same idea, being a search engine to do log analytics, but their's is proprietary.
 
Thanks for sharing Mark - I had not heard of this project .. 

Do you just periodically copy the log files to a SphinxSearch drive and then just point the product at it ?
 
If the logs are on unix/linus, I use rsyslog which forwards the logs via udp or tcp, I wrote a simple tcp server to listen to all the rsyslog logs coming in, then input the logs into a sphinx real-time index via a mysql client (don't actually need mysql).  Then can query the logs in any programming language you want via a mysql client (again you don't actually need mysql). The best thing is this is real-time analysis of your logs listening to many machines.

This is the same idea behind splunk/loggly, but using sphinx.
 
I've done work with Splunk. It is built for that use case. Definitely recommended.
 
Depending on how big you want it to be, you might want to look at something like Cassandra/HBase. Combine those with Hive/Pig and you can do complex queries with great scalability. Cassandra gives you that nice Federated model too.
 
I use Splunk for pulling in syslog data from all my network gear as well as IIS logs. It does take some effort to get out what you're looking for but I think it has certainly been worth the effort. If you are recording less than 500 MBytes of logs per day you can even use it for free. Above that value, it's costed by your ingestion volume.
 
It's totally about the type of queries you do. How much will they be dominated by time selection, for example. If you need mostly text (or token) search along with a few numeric attributes, then Xapian or Lucene might be quite good, because they are pretty low level engines. But assuming that you a) always select data temporally and b) always select some domains, the very simple HDFS layout of "<temporalwindow>/<hashcodeofdomain>" likely kicks ass out of any "database". Because it is a distributed index very much optimized for your type of queries. last 7 days x 2 domains = 14 chunks to process, ideal for distributed processing. You can easily adjust temporal resolution of the index to hours if you want, or scale the hash function to produce more or fewer buckets to get the desired number of chunks.
 
Redis, depending on how complex the queries are.
 
What about not using a relational database, if it is really big data give a chance to map-reduce (i.e. hadoop)
 
Another recommendation for Splunk here.  Too many reasons to list, but yes, log data is a typical starting point for leveraging Splunk.
 
Agreed Jason - the general idea of Splunk is that it is "ready to go" - Other solutions are possible - but you need to do more work to get the results .. 

At least that is how I would word it
 
Redis is great for many use cases too. Need to build your own plumbing, in this case. But there's a lot of interesting algo + infra work which Redis makes quite simple. The atomicity of macros, sorted sets, etc., are amazing. And in performance testing, we had to struggle to find ways to create a significant load... on a single process. Loading the NIC was more often the issue.
 
I have used the Infobright Community Edition (ICE) variant of MySQL very successfully for machine-generated data. To quote from http://www.infobright.org/:
"Infobright's column oriented, open-source analytic database provides both speed and efficiency.  Coupled with 10:1 average compression, ICE takes your machine generated data and gives you the ultimate power to slice-n'-dice your data.  Plus, it's free!"
Add a comment...