Recently, I was reading an article on Wall Street Journal Online that discussed “data scraping” information without the consent of the creators of the data. Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program, according to Wikipedia.
My first reaction was that I am familiar with that technique, but haven’t given it much active thought over the last several years. When I worked for a company that did CRM software development back in the early ’90′s, we did something similar. As part of the design of the application I led, we wanted to pre-populate look-up tables with various entries. Simply put, if a field was designated to be used for ‘Country’, for example, we wanted to makes sure all of the companies were pre-populated. Very straight forward. However, if you wanted to pre-populate the ‘Title’ field look-up table, for instance, or the ‘First Name’ field, this we accomplished by reading text in a computerized directory, parsing then indexing the text, to come up with a list of values that we could import into the respective look-up table. No great shakes.
More recently, back in 2003, TriAxis used a company that could take PDF-based directories and import the text into database fields. I still see them out there positioning their product as a way to glean data off the web for acquiring more prospects.
My second, and near immediate, reaction was one of violation. When one company surreptitiously goes onto sites in order to ‘scrape’ data, without permission or regard for whom the data belongs or from where the data is obtained, then I think that should be dealt with swiftly and decisively. The fact that they don’t ask if they can do this says they, corporately, would rather “beg forgiveness than ask permission”. We are too lenient about this, which is why companies don’t fear reprisal and continue to push the limits on data acquisition. Google still argues that it’s Wi-Fi scraping is “… for the good of its customers …”, although it claims it has stopped that practice. That type of mentality is reminiscent of why we have Obamacare, but that is not for this column…
So, we have the technology that can be misused. It’s all about having information that can be used to an advantage. So, the collecting and storing of all types of data will continue, because the technology exists to do so. It makes me think of the ‘Jurassic Park’ quote by Dr. Ian Malcolm, “Yeah, but your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.”
I believe the choice of whether we should or should not needs to be a decision made upfront, long before they did because they could.
I’m not really sure where I’m going with this — maybe this is a bit of a rant. But I will conclude that, as long as the possessor of information sees power by possession, then these tactics will find new ways to proliferate. And we storage guys will continue to look at new ways to store, manage and protect it.
