|
|
Presuming identical hardware and a platform-independent language (that is sadly not explicitly parallelizable without modification to the interpreter itself), does anyone know how performance stacks up for running a batch job on Linux vs OBSD vs FBSD vs Solaris 10 vs Cygwin? Specifically, I will be running iterations of large, nasty network-simulations using ns2, which uses a derivative of Object-TCL to describe the network (and which does not provide a facility to parallelize the task), and have a machine that should be sufficiently butch to handle the load. I intend to create the simulation and leave it crunching over-night, but does anyone know how performance changes as a function of the number of nodes and rate of traffic generated? Are the constraining factors for these simulations something that can be compensated for by choosing an optimal operating environment? I've run much smaller, simpler simulations on a time-sharing Solaris server (E3500) and on my laptop (AMD semperon, Cygwin under WP-Pro), and even with the more performant machine on which to run the simulation (mirrored 10kRPM U320 drives, dual core AMD processor, 4 GBytes RAM), I am not optimistic about the speed of running the simulation, and being able to iteratively tweak large simulations (probably around 1M nodes, many linked to more than one other node) in any sane amount of time. If the simulator were parallelizable, I would rent time on a cluster (or, rather, try and barter to get someone to rent time on my behalf if the price were out of my range). Thanks and sorry for the rambling question
5 responses total.
Will you be doing mostly computation? Disk I/O? Network I/O? Memory allocation and freeing? Process and/or thread setup and control? Not that I know the answer to your question anyway, but *nobody* could begin to answer it without knowing more about what the performance chokepoints in your program are likely to be.
I realize I had a badly formulated and incomplete question. Part of the problem is that I do not know what the chokepoints are for this software, and searching via Google and a couple of other resources show lots of information about using ns2 to find chokepoints in networks, but those overwhelm the results, keeping anything with information about chokepoints in the simulation software from showing up. From what I know about the underlying system, there will be a large amount of disk access, though it will all be to a pair of files, so I presume on a fresh install of the OS, these files should be fairly contiguous, taking out the overhead from seeking. Additionally, there will be a lot of treads spawned for creating the topology, with the actual running of the simulation being run in-order. I know memory consumption will also be huge (my understanding is that it keeps the entire topology in memory, along with a route-map for each node that performs packet switching duties, at the very least in the beginning). In short, I guess this will be slow in every way except for use of the network (since the simulation runs entirely on one machine). Thanks for forcing me to think about the question in a more sane way.
Looking deeper into the archives of the mailing list for the software showed some information. Namely, since each node represents a stand-alone device, each node keeps as much information about the network as would be kept in a kernel-routing-table, so when the simulation is running, a lot of information is kept in memory. I'll try to put some markers into the early simulations to see if they can tell me more about where the bottlenecks are, and will re-ask the question in a more sane way, with more/better information.
Sounds like memory will be huge. Did I gather correctly that there will be a thread for each node? If things haven't changed since I last knew the Sun should be way faster at the context switches involved in all those threads. I'd look at the real clock time for context switches for each of your candidate systems both normal loading and in the pathological case where number of contexts greatly exceeds the number of hardware context frames. I'd also be interrested in the amount of work done on any one thread before switching to another WRT memory locality and cache hit rates. Specifics of the cache hardware and cache clearing on context switches may play a big part. There's a good chance that hardware will play as much role as OS.
Memory will be huge when I run the *really big* simulations. The box has 4 GBytes of physical RAM and will probably wind up having as many swap files as the OS allows (I believe Linux allows 8, not sure about UNIX/BSD, and if I go with Cygwin, I'll get Windows's pathologically enthusiastic paging setup). I am pretty sure it is a separate thread for each node creation, and then it is a single thread for the whole network for the actual running, though I could be wrong. The truly hardcore, deep system-level performance analysis (memory locality, cache hit rates, context switching) is out of my depth. I will read up on it, though. If the programme were more parallelizable and would benefit from the sun architecture, I would set up the three Netra clones and the Netra and the Ultra5 as a compute cluster. If I thought I would be done before graduation, I would just run this on the Sun at school (I forget if it is an E3000 or an E3500, but it is big and butch and spiffy). Anyone have a spare v880 that they don't want? I'll take it off your hands and give it a good home doing scientific computing (and never need a heater in the winter).
Response not possible - You must register and login before posting.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss