So the problem is that I wanted to randomly generate these user profiles, so I ended up making this data type an instance of Random which is entirely uninteresting & boilerplate code.
Now, essentially the logic for generating the list of profiles looked like, exaggerated for full effect,
import Control.Monad
import Random
filepath = "crud.csv"
generateProfiles :: Int -> IO [Int]
generateProfiles i = foldM aux [] [1..i]
where aux is i = do
sub <- randomIO
return (sub:is)
main = generateProfiles 10000000 >>= (writeFile filepath . unlines . map show)
Run this! Run this and weep for your feeble RAM as it is withers under the gaze of my space leak!
Alright, so this is pretty awful isn't it? It came from trying to think about the problem as "well, I want to generate 'i' many subscribers so I'm going to iterate over a list of i length, generating a subscriber each time". Bzzt, wrong. That's trying to (badly) emulate a C style for-loop in Haskell, not writing Haskell, and is a very bad idea.
Well, I took a look at this code with a fresh eye, was terrified, and then realized what would probably be a more idiomatic Haskell program: generate an infinite list of subscribers, and just write the first 'i' to a file. After all, if I'm truly being lazy then only as many as I need to write should be created. So the new approach looks something like
betterGenerate :: (RandomGen g) => g -> [Int]
betterGenerate = randoms
main = do
std <- newStdGen
let subs = betterGenerate std
(writeFile filepath . unlines . map show . take 10000000) subs
which takes a few seconds, but doesn't eat RAM all to Hell. The main point here, for me, was to stop trying to write C# in Haskell and just write Haskell. There's probably still ways I could improve the logic of my little utility, but I can now generate 1 million subscribers, or about 120 MB of data, in roughly 2 minutes which is the upper limit of what I need.
3 comments:
The head of generateProfiles' list is sub_i, so it has to build the entire list in memory to write out the first line. Of course (is ++ [sub]) doesn't do much better, and mapM (const randomIO) [1..i] is only a little better.
Right, because all of those ways still force the data. Before I came up with the proper solution I tried mapM, but when that still failed I looked at the definition of the function and saw that it was sequence . map.
Well, looking at the definition of sequence it didn't look like it would play nicely with space constraints. As a test, I tried
take 10 $ sequence $ repeat $ return ()
at the command line, blowing the stack.
Yes, by default, everything you do in the IO monad is strict (because lazyness don't play well with side-effects), it's not the fault of sequence or mapM (which are lazy in lazy monad, you can try with the State monad or Maybe monad or...).
To introduce lazyness in a IO context you have to use unsafeInterleaveIO, which as its name imply should only be used with precaution since it can have nasty effects (look at lazy IO for some of the dangers) though in your case it would have been safe.
Much better in your case (as you've probably done) was to write an instance of Random for Profile and call randoms.
Post a Comment