Simple Image Based Persistence in Squeak
By Ramon Leon - 14 January 2008 under Databases, Programming, Smalltalk, Sql
One of the nicest things about prototyping in Smalltalk is that you can delay the need to hook up a database during much of your development, and if you're lucky, possibly even forever.
It's a mistake to assume every application needs a relational database or even a proper database at all. It's all too common for developers to wield a relational database as a golden hammer that solves all problems, but for many applications, they introduce a level of complexity that can make development feel like wading through a pond full of molasses where you spend much of your time trying to keep the database schema and the object schema in sync. It kills both productivity and fun, and goddammit, programming should be fun!
This is sometimes justified, but many times it's not. Many business applications and prototypes are built to replace manual processes using Email, Word, and Excel. Word and Excel, by the way, aren't ACID compliant, don't support transactions, and manage to successfully run most small businesses. MySql became wildly popular long before it supported transactions, so it's pretty clear a wide range of apps just don't need that, no matter how much relational weenies say it's required.
It shouldn't come as a surprise that one can take a single step up the complexity ladder and build simple applications that aren't ACID compliant, don't support transactions, and manage to successfully run most small businesses better than Word and Excel while purposely not taking a further step and moving up to a real database which would introduce a level of complexity that might blow the budget and make the app infeasible.
No object-relational mapping layer (not even Rails and ActiveRecord) can match the simplicity, performance, and speed of development one can get just using plain old objects that are kept in memory all the time. Most small office apps with no more than a handful of users can easily fit everything into memory, this is the idea behind Prevayler.
The basic idea is to use a command pattern to apply changes to your model, you can then log the commands, snapshot the model, and replay the log in case of a crash to bring the last snapshot up to date. Nice idea, if you're OK creating commands for every state-changing action in your application and being careful with how you use timestamps so replaying the logs works properly. I'm not OK with that, it introduces a level of complexity that is overkill for many apps and is likely the reason more people don't use a Prevayler like approach.
One might attempt to use the Smalltalk image itself as a database (and many try), but this is ripe with problems. My average image is well over 30 megs, saving it takes a bit of time and saving it while processing HTTP requests risks all kinds of things going wrong as the image prepares for what is essentially a shutdown/restart cycle.
Using a ReferenceStream to serialize objects to disk Prevayler style, but ignoring the command pattern part and just treating it more like crash proof image persistence is a viable option if your app won't ever have that much data. Rather than trying to minimize writes with commands, you just snapshot the entire model on every change. This isn't as crazy as it might sound, most apps just don't have that much data. This blog, for example, a year and a half old, around 100 posts, 1500 comments, has a 2.1 megabyte MySql database, which would be much smaller as serialized objects.
If you're going to have a lot of data, clearly this is a bad approach, but if you're already thinking about how to use the image for simple persistence because you know your data will fit in ram, here's how I do it.
It only takes a few lines of code in a single abstract class that you can subclass for each project to make a Squeak image fairly robust and crash-proof and more than capable enough to allow you just to use the image, no database necessary. We'll start with a class...
Object subclass: #SMFileDatabase instanceVariableNames: '' classVariableNames: '' poolDictionaries: '' category: 'SimpleFileDb' SMFileDatabase class instanceVariableNames: 'lock'
All the methods that follow are class side methods. First, we'll need a method to fetch the directory where rolling snapshots are kept.
backupDirectory ^ (FileDirectory default directoryNamed: self name) assureExistence.
The approach I'm going to take is simple, a subclass will implement #repositories to return the root object that needs to be serialized, I just return an array containing the root collection of each domain class.
repositories self subclassResponsibility
The subclass will also implement #restoreRepositories: which will restore those repositories to wherever they belong in the image for the application to use them.
restoreRepositories: someRepositories self subclassResponsibility
Should the image crash for any reason, I want the last backup will be fetched from disk and restored. So I need a method to detect the latest version of the backup file, which I'll stick a version number in when saving.
lastBackupFile ^ self backupDirectory fileNames detectMax: [:each | each name asInteger]
Once I have the file name, I'll deserialize it with a read-only reference stream (don't want to lock the file if I don't plan on editing it)
lastBackup | lastBackup | lastBackup := self lastBackupFile. lastBackup ifNil: [ ^ nil ]. ^ ReferenceStream readOnlyFileNamed: (self backupDirectory fullNameFor: lastBackup) do: [ : f | f next ]
This requires you to extend ReferenceStream with #readOnlyFileNamed:do:, just steal the code from FileStream so nicely provided by Avi Bryant that encapsulates the #close of the streams behind #do:. Much nicer than having to remember to close your streams.
Now I can provide a method to restore the latest backup. Later, I'll make sure this happens automatically.
restoreLastBackup self lastBackup ifNotNilDo: [ : backup | self restoreRepositories: backup ]
I like to keep around the last x number of snapshots to give me a warm fuzzy feeling that I can get old versions should something crazy happen. I'll provide a hook for an overridable default value in case I want to adjust this for different projects.
defaultHistoryCount ^ 15
Now, a quick method to trim the older versions so I'm not filling up the disk with data I don't need.
trimBackups | entries versionsToKeep | versionsToKeep := self defaultHistoryCount. entries := self backupDirectory entries. entries size < versionsToKeep ifTrue: [ ^ self ]. ((entries sortBy: [ : a : b | a first asInteger < b first asInteger ]) allButLast: versionsToKeep) do: [ : entry | self backupDirectory deleteFileNamed: entry first ]
OK, I'm ready to serialize the data. I don't want multiple processes all trying to do this at the same time, so I'll wrap the save in a critical section, #trimBackups, figure out the next version number, and serialize the data (#newFileNamed:do: another stolen FileStream method), ensuring to #flush it to disk before continuing (don't want the OS doing any write caching).
saveRepository | version | lock critical: [ self trimBackups. version := self lastBackupFile ifNil: [ 1 ] ifNotNil: [ self lastBackupFile name asInteger + 1 ]. ReferenceStream newFileNamed: (self backupDirectory fullPathFor: self name) , '.' , version asString do: [ : f | f nextPut: self repositories ; flush ] ]
So far so good, let's automate it. I'll add a method to schedule the subclass to be added to the start-up and shutdown sequence. You must call this for each subclass, not for this class itself.
UPDATE: This method also initializes the lock and must be called prior to using #saveRepository, this seems cleaner.
enablePersistence lock := Semaphore forMutualExclusion. Smalltalk addToStartUpList: self. Smalltalk addToShutDownList: self
So on shutdown, if the image is going down, just save the current data to disk.
shutDown: isGoingDown isGoingDown ifTrue: [ self saveRepository ]
And on startup, we can #restoreLastBackup.
startUp: isComingUp isComingUp ifTrue: [ self restoreLastBackup ]
Now, if you want a little extra snappiness and you're not worried about making the user wait for the flush to disk, I'll add little convenience method for saving the repository on a background thread.
takeSnapshot [self saveRepository] forkAt: Processor systemBackgroundPriority named: 'snapshot: ' , self class name
And that's it, half a Prevayler and a more robust easy to use method that's a bit better than trying to shoehorn the image into being your database for those small projects where you don't want to bother with a real database (blogs, wikis, small apps, etc). Just sprinkle a few MyFileDbSubclass saveRepository or MyFileDbSubclass takeSnapshot's around your application whenever you feel it important, and you're done.
Here's a file out if you just want the code fast, SMFileDatabase.st
Comments (automatically disabled after 1 year)
Yes, better to use early initialization.
Yea, Paolo's right, that should be in an initialize, however, it's a one time event that you pretty much trigger yourself while setting up and testing your repository. So the likelihood of multiple processes hitting this is virtually nil.
Being a class instance variable, I can't use initialize since that'd initialize this superclasses lock rather than the subclasses required to use it. Each subclass has it's own lock. I didn't want every subclass to have to initialize it, so I chose this method given the insanely low chance you'd ever have a nil by the time you put it under multiple processes.
Thinking about it... maybe this is why some people prefer using an instance as a singleton rather than using the class instance itself, easier to initialize.
Hey Ramon, great article. Added a link from the squeak wiki: http://wiki.squeak.org/squeak/512
Wonderful article. My comments (including a fun graphic) are at http://methodsandmessages.vox.com/library/post/ramon-on-keep-it-simple-persistence.html
Thanks, glad you enjoyed the code!
Have just enjoyed it.
It's all we all just need. Just simple persistence.
Ramon have you had look at Java Persistence API (JPA) ? It is very simple to use. It is really worth using. Best articles to get started with JPA are: ejb3-persistence-api-for-client-side-developer and more-ejb3-persistence-api-for-client-side-developer
Otherwise most documentation of JPA is so rubbish that i stayed away from learning JPA for one week. (Specially Sun's official Java EE Tutorial).
You do realize this is a Smalltalk site right? I wouldn't touch Java with a ten foot pole, it's a horrible language built to enslave programmers and turn them into cogs in the giant corporate wheels of big companies who don't care about them.
Very nice solution, elegant and simple yet not simplistic - leading to a potent empowerment for developers applications - desk, server, web, client, remote, autonomous, or embedded. Sweet. Keep up the excellent work Ramon.
Now if there was only a way to snapshot only those objects in the image that have changed since the last snapshot or "checkpoint". In a way a "bulk" transaction without the bulk of saving each and every object each and every snapshot. Hey if it hasn't changed the prior checkpoints or snapshots should have it - as long as you never delete the most recent FULL snapshot.
Ofcourse i do realize that. But sometimes you can find pleasent surprices in Java also. I mentioned JPA only because it is something different than everything else in wild J2EE land. Simple like Smalltalk. However i do not have any idea how complex its implementation is.
Great post, I've already started using this thanks to you Ramon.
One thing you might want to mention --- I don't' know how ReferenceStreams handle changes to schema, in case you change things between save & restore. Could also mention SixxReadStream and SixxWriteStream as a (probably slower) option that might handle some schema changes a bit better.
Yea, it was your trying to use image persistence that made me decide to share this, it was already available in my image, but I figured a post about it would be better.
As for changing your classes (schema), it's serialization, so just don't change your class names and lazily initialize your instance variables in their accessors and you're golden. New instance vars will be nil by default and get a value on first access, and inst vars you remove will just disappear.
However, since all the data is stored in the image anyway, you can just live upgrade the code and save a fresh backup and not worry so much about restoring old data. The backups are just that, backups, just in case, the real objects live in the image.
You can use Sixx streams instead of Ref streams if you prefer XML to a binary format, but it's a lot slower, and you get much larger snapshot file sizes, so it won't scale nearly as well as Ref streams.
Sophie, you've been very active on the list, you seem to have learned a lot, any chance you'll take the plunge and start another Seaside blog?
Thanks for the clarification on serialization. On blogging ... not quite confident yet, not ready with infrastructure (what do you blog with?)
Sophie
Don't wait till you get confident, you'll rob readers of going through the learning experience with you because you'll forgot what things you got stuck on. I use Wordpress on my own Linux server, no need to reinvent the wheel, but you can start up a blog in minutes on wordpress.com and move it later if you ever get your own thing going.
Sophie, you can also use pier and the blog component... I found it works great and is robust (http://lukas-renggli.ch/blog). Maybe you'll have to spend a bit more time to setup the pier instance (even not sure), but it's 100% smalltalk. You can use seasidehosting.st or your personal box for hosting.
Excellent post Ramon, one of my prefered... Thanks a lot :)
Hello Ramon,
I like the idea very much - it is a bit more sophisticated that the simple ObjectFiler approach I am using in two of my (rather simple but handy) VSE programs. The absence of a backup is a bit annoying therem so I am thinking of adding something of your approach to that.
About the JPA poster: yes, it is interesting, but certainly overblown (as all J2EE stuff) for anything but real enterprise applications. And most even mid-sized companies simply do not need that.
Greetings from Germany, Claus
Hello Ramon.
I have been reading your blog for a while now, and enjoy it very much.
I have been working on an app to learn Seaside and Smalltalk recently and thought the idea in this post was great, but I have been having trouble making it work. I am pretty new to Squeak, so it is entirely possible I am missing something obvious.
I don't think any of your code is the problem. #takeSnapshot works, #enablePersistance works. The problem is in #restoreRepositories in my subclass (GLDatabase). I have a single model class (SystemInfo with class instance var 'repository', and accessor/mutator for it) until I get it working.
Here is the relevant code:
SystemInfo class>>repository respository ifNil: [self repository: #()]. ^repository
GLDatabase class>>repositories ^Array with: (SystemInfo repository)
GLDatabase class>>restoreRepositories: someRespositories SystemInfo repository: (someRepositories first)
I should note that I had to copy several methods (#concreteStream, #detectFile:do:, #fullName: and #readOnlyFileNamed:) which you did not mention, in order to get the code to run.
I made a little test to figure out where the problem is:
| written read | Transcript clear. ReferenceStream newFileNamed: 'temp' do: [: f| written := f nextPut: #(#()). f flush ]. read := ReferenceStream readOnlyFileNamed: 'temp' do: [ : f | f next ].
Transcript show: 'written = '; show: written; cr; show: 'read = '; show: read; cr.
In my case the transcript reads: written = #(#()) read =
If I Ctrl-p on read, it says 'Character backspace', which explains why I was getting 'Character doesNotUnderstand: #first' error from #restoreRepositories.
I think the error is coming from #concreteStream
ReferenceStream>>concreteStream "Who should we really direct class queries to? " ^ MultiByteFileStream.
Am I supposed to be using a different Stream class for object serialization?
Anyway, sorry for the long comment. I hope someone can tell me what I am doing wrong.
Thanks
You're stealing too much from FileStream, you really just need to encapsulate the #close, here are my extentions to DataStream...
fileNamed: aName do: aBlock | file | file := self fileNamed: aName. ^ file ifNil: [ nil ] ifNotNil: [ [ aBlock value: file ] ensure: [ file close ] ] newFileNamed: aName do: aBlock | file | file := self newFileNamed: aName. ^ file ifNil: [ nil ] ifNotNil: [ [ aBlock value: file ] ensure: [ file close ] ] oldFileNamed: aName do: aBlock | file | file := self oldFileNamed: aName. ^ file ifNil: [ nil ] ifNotNil: [ [ aBlock value: file ] ensure: [ file close ] ] readOnlyFileNamed: aString | strm | strm := self on: (FileStream readOnlyFileNamed: aString). strm byteStream setFileTypeToObject. ^ strm readOnlyFileNamed: aName do: aBlock | file | file := self readOnlyFileNamed: aName. ^ file ifNil: [ nil ] ifNotNil: [ [ aBlock value: file ] ensure: [ file close ] ]
It works now. Thank you.
I was literally copying the code from FileStream to make it run, rather than modifying it. There are two differences between your code and what I had.
And #readOnlyFileNamed:do: was using #detectFile:do:, which encapsulates the #ifNil:ifNotNil: call on file. And, my #readOnlyFileNamed looked like this:
readOnlyFileNamed: fileName
^ self concreteStream readOnlyFileNamed: (self fullName: fileName)
Which is where I got the #concreateStream and #fullName: methods.
I still don't really get why it wouldn't work before. Why would calling concreteStream from within ReferenceStream be any different than delegating it to FileStream?
Anyway, thank you for your help.
A question about the Prevayler approach:
For those cases where the UI framework knows which action will be invoked on which object (e.g. the #on:of: style callbacks, or through some other means like the #actions in your SSForm), would it be possible for the logging to be done by the framework itself, rather than any explicit Command-pattern objects by the programmer? There seem to be multiple ways to do the wrapping, from method-wrappers to ByteSurgeon to ...
- Bill
I'm sure one could come up with something that worked, but it'd be an approach I wouldn't like because it's not simple and obvious. I used to think that auto tracking changes with some kind of write barrier or change observer and automatically commiting was simple too, but every such system I've worked with I find I'm always doing battles with auto thing trying to make it behave how I want.
The conclusion I've come to, especially after seeing the success of ActiveRecord (a pattern I used to think sucked) in the Ruby community, is that nothing beats an explicit call to save by the programmer because it allows things to work outside of some special context where the auto-magic is available, and it's easy for the programmer to see exactly what's going on. Explaining how ActiveRecord's work to another programmer and getting him productive fast is trivial, explaining how Glorp works and getting him productive fast is near impossible, too much magic for many programmers to keep in mind.
I'm becoming less and less a fan of magic and more and more a fan of simple straight forward solutions that just work (like saving objects to a file).
Hi,
very interesting post. Thank you.
I have a doubt about your #saveRepository method.
saveRepository lock ifNil: [ lock := Semaphore forMutualExclusion ]. lock critical: [...]
Imaging two process enter the method at the same time and lock is nil. They will both come into the #ifNil: block. Imagine the first process assign the instance variable lock and enter the #critical: block. Then, the second process override the instance variable and I don't see anything preventing it from entering the #critical: block because the lock is not the same. What do you think?
Bye