Field notes on the Web: ridiculous amounts of data

Showing posts with label ridiculous amounts of data. Show all posts

Tuesday, April 16, 2019

Distributed astronomy

Recently, news sources all over the place have been reporting on the imaging of a black hole, or more precisely, the immediate vicinity of a black hole. The black hole itself, more or less by definition, can't be imaged (as far as we know so far). Confusing things a bit more, any image of a black hole will look like a black disc surrounded by a distorted image of what's actually in the vicinity, but this is because the black hole distorts space-time due to its gravitational field, not because you're looking at something black. It's the most natural thing in the world to look at the image and think "Oh, that round black area in the middle is the black hole", but it's not.

Full disclosure: I don't completely understand what's going on here. Katie Bouman has done a really good lecture on how the images were captured, and Matt Strassler has an also really good, though somewhat long overview of how to interpret all this. I'm relying heavily on both.

Imaging a black hole in a nearby galaxy has been likened to "spotting a bagel on the moon". A supermassive black hole at the middle of a galaxy is big, but even a "nearby" galaxy is far, far away.

To do such a thing you don't just need a telescope with a high degree of magnification. The laws of optics place a limit on how detailed an image you can get from a telescope or similar instrument, regardless of the magnification. The larger the telescope, the higher the resolution, that is, the sharper the image. This applies equally well to ordinary optical telescopes, X-ray telescopes, radio telescopes and so forth. For purposes of astronomy these are all considered "light", since they're all forms of electromagnetic radiation and so all follow the same laws.

Actual telescopes can only be built so big, so in order to get sharper images astronomers use interferometry to combine images from multiple telescopes. If you have a telescope at the South Pole and one in the Atacama desert in Chile, you can combine their images to get the same resolution you would with a giant telescope that spanned from Atacama to the pole. The drawback is that since you're only sampling a tiny fraction of the light falling on that area, you have to reconstruct the rest of the image using highly sophisticated image processing techniques. It helps to have more than two telescopes. The Event Horizon Telescope project that produced the image used eight, across six sites.

Even putting together images from several telescopes, you don't have enough information to precisely know what the full image really would be and you have to be really careful to make sure that the image you reconstruct shows things that are actually there and not artifacts of the processing itself (again, Bouman's lecture goes into detail). In this case, four teams worked with the raw data independently for seven weeks, using two fundamentally different techniques, to produce the images that were combined into the image sent to the press. In preparation for that, the image processing techniques themselves were thoroughly tested for their ability to recover images accurately from test data. All in all, a whole lot of good, careful work by a large number of people went into that (deliberately) somewhat blurry picture.

All of this requires very precise synchronization among the individual telescopes, because interferometry only works for images taken at the same time, or at least to within very small tolerances (once again, the details are ... more detailed). The limiting factor is the frequency of the light used in the image, which for radio telescopes is on the order of gigahertz. This means that images from the telescopes have to be recorded on the order of a billion times a second. The total image data ran into the petabytes (quadrillions of bytes), with the eight telescopes producing hundreds of terabytes (that is, hundreds of trillions of bytes) each.

That's a lot of data, which brings us back to the web (as in "Field notes on the ..."). I haven't dug up the exact numbers, but accounts in the popular press say that the telescopes used to produce the black hole images produced "as much data as the LHC produces in a year", which in approximate terms is a staggering amount of data. A radio interferometer comprising multiple radio telescopes at distant points on the globe is essentially an extremely data-heavy distributed computing system.

Bear in mind that one of the telescopes in question is at the south pole. Laying cable there isn't a practical option, nor is setting up and maintaining a set of radio relays. Even satellite communication is spotty. According to the Wikipedia article, the total bandwidth available is under 10MB/s (consisting mostly of a 50 megabit/second link), which is nowhere near enough for the telescope images, even if stretched out over days or weeks. Instead, the data was recorded on physical media and flown back to the site where it was actually processed.

I'd initially thought that this only applied to the south pole station, but in fact all six sites flew their data back rather than try to send it over the internet (just to throw numbers out, receiving a petabyte of data over a 10GB/s link would take about a day). The south pole data just took longer because they had to wait for the antarctic summer.

Not sure if any carrier pigeons were involved.

Friday, April 6, 2012

What's twice as big as the internet?

(Yikes ... I went 0 for March!)

I've mentioned before that telescopes can generate a lot of data. IBM seems inclined to drive the point home by collaborating with ASTRON (the Netherlands Institute for Radio Astronomy) to put together "exascale" computing horsepower behind the world's largest radio telescope.

The telescope is actually (or rather, will be) an array of millions of antennas spread out over a square kilometer, from which the name SKA, for Square Kilometer Array. This array is expected to produce on the order of an exabyte of data per day. This is an absolutely ridiculous amount of data by today's standards. Think one million terabyte disk drives, or twenty million feature film's worth of Blu-ray, or ... according to IBM, twice the daily volume currently carried on the internet.

I'm a little skeptical as to exactly how one measures that, but hey, you've got to trust a press release, right?

So where do you put an exabyte a day worth of data? Well, you don't. You're certainly not going to upload it to the web. Particle physicists are faced with the same problem of having to figure out what portion of a huge data set to keep for later analysis, and a large part of running an experiment is setting up the "trigger" criteria by which the software collecting the data will decide what to keep and what to throw. IBM and ASTRON's system will be dealing with the same problem, but on an even larger scale.

Or I suppose you could sign up two million people and somehow stream an equal share of the data to each at Blu-ray resolution all day every day, but somehow I doubt that kind of crowdsourcing will help much.

Wednesday, October 5, 2011

Crowdsourcing the sky

Astronomy has been likened to watching a baseball game through a soda straw. For example, the Hubble Deep Field, assembled from 342 images taken over the course of ten days, covers about 1/500,000th of the sky, or about the size of a tennis ball seen a hundred yards away. It's quite possible to survey large portions of the sky, but there are trade-offs involved since you can only collect so much light so fast. To cover a large area and still pick up faint objects, you need some combination of a big telescope and a lot of time. The bigger the telescope (technically, there's more to it than sheer size) the faster you can cover a given area down to a given magnitude (how astronomers measure faintness).

The Large Synoptic Survey Telescope (LSST) is designed to cover the entire sky visible from its location every three days, using a 3.2 gigapixel camera and three very large mirrors. In doing this, it will produce stupefying amounts of data -- somewhere around 100 petabytes, or 100,000 terabytes, over the course of its survey. So imagine 100,000 terabyte disk drives, or over 2 million two-sided Blu-ray disks. Mind, the thing hasn't been built yet, but two of its three mirrors have been cast, which is a reasonable indication people are serious. Even if it's never finished, there are other sky surveys in progress, for example the Palomar Transient Factory.

Got a snazzy 100 gigabit ethernet connection? Great! You can transfer the whole dataset in a season -- start at the spring equinox and you'll be done by the summer solstice. The rest of us would have to wait a little longer. My not-particularly-impressive "broadband" connection gets more like 10 megabits, order-of-magnitude, so that'd be more like 2500 years, assuming I don't upgrade in the meantime and leaving aside the small question of where I'd put it all.

Nonetheless, the LSST's mammoth dataset is well within reach of crowdsourcing, even as we know it today:

Galaxy Zoo claims that 250,000 people have participated in the project. Many of them are deadbeats like me who haven't logged in for ages, but suppose there are even 10,000 active participants.
The LSST is intended to produce its data over ten years, for an average of around 2-3Gbps. Still fairly mind-bending -- about a thousand channels worth of HD video, but ...
Divide that by our hypothetical 10,000 crowdsourcers and you get 200-300Kbps, not too much at all these days. Each crowdsourcer could download a 3GB chunk of data in under an hour in the middle of the night or spread it out through the day without noticeably hurting performance.
Assuming you kept all the data, you'd need a new terabyte disk every few months, so that's not prohibitive either.
The hard part is probably uploading a steady stream of 2-3Gbps (bittorrent wouldn't help here, since each recipient gets a unique chunk of data). As far as I can tell the bandwidth is there, but at that volume I'm guessing the cost would be significant.
In reality, there would probably be various reasons not to ship out all the raw data in real time, but instead send a selection or a condensed version.

Bottom line, it's at least technically possible with today's technology, to say nothing of that available when the LSST actually goes online, to distribute all the raw data to a waiting crowd of amateur astronomers.

Wikipedia references a 2007 press release saying Google has signed up to help. As usual I don't know anything beyond that, but it does seem like a googley thing to do.

Field notes on the Web

Tuesday, April 16, 2019

Distributed astronomy

Friday, April 6, 2012

What's twice as big as the internet?

Wednesday, October 5, 2011

Crowdsourcing the sky

About Me

My other blog

People following Field Notes

FeedBurner

Search This Blog

Blog Archive

Reader Picks

Labels

Search This Blog

Pages

Hyperlinks vs. the web

Report Abuse