Wednesday, December 11, 2013

"So Much Data, so Little Time" Data Mining and what to do with the Data

This is a topic that I have thought on and off about for 20 years or so, but with the released of material from Edward Snowden about NSA spying on us I thought I would comment on data mining.  Particularly BIG data mining and what to do with the data obtained.

What is data mining?  Its where some person or group collects large amounts of data from the internet.  Then they  search this data for whatever information they are wanting to study.  I first became aware of this around 1992-93 when I was involved with my first webpage.  This was when the web was first starting to run, (I only knew of the browser Mosaic, which later evolved to Netscape).  What was happening was that search engines (Yahoo, Google didn't exit then) sent out little programs, I can't remember what they where called, that looked for websites and collected information from those websites so that particular websites came up when you did a web search. If you ask yourself  when you search on a topic "how does the search engine know what sites are out in internet land that are related to my topic?"  Well its because the search engine did data mining of the internet to locate websites and obtain information from them  This was then analyzed so that  you got websites that were related to the topic you were curious about.  So the search engines were mining the internet looking for information that it could present you when you did a search.  How it went about sorting through all of this information to find you the websites you wanted is what made Jerry Yang, and other  founders of Yahoo, very wealthy men.

If you think about sorting through all of the information that the internet has and then presenting you with the information you want, that is a very difficult problem.  Nowadays this is what big data mining is all about.  It took the release of what the NSA was doing in data mining that brought data mining to everyone's attention.  The NSA was and is doing huge sweeping collections of data located in the net.  They say they were interested in collecting information about terrorists but in doing so they were and are collecting data on everyone and everything (maybe why the Supreme Court gives corporations and other nonhuman entities the rights that humans have).  The problem is sorting through all that information to get the information that you want.  Imagine wanting to catch a particular fish, yet you collect everything in the lake.  You now have to sort through it all to get the fish.  Now the sorter is a computer, not a human, so the programmer has to figure out how to get the computer to locate that particular fish.  Tough problem, but the results are incredibly valuable, both intellectually and monetarily.

What you look at online can be very valuable to people selling stuff. Advertisers can decide who to buy ads from  by how many hits a site gets.  I remember when the web first started everyone was putting webcounters at the bottom of their webpage that counted how many times their site was viewed.  You can think of all the possibilities this led to in selling to people information about how people used the web and for what.  Now the scary thing comes in.  What about knowing what individuals do on the web?  Imagine someone knowing all the sites that you visit and trying to figure out what you are doing.?  Someone knowing all the phone numbers that you call, the content of all your text messaging?  Knowing about your banking?  Knowing all your medical history even down to obtaining your MRI,  and X-ray images?  How would you like it if everything that you do on the web and every thing known about you electronically stored is available for someone to look at?  Remember everything attached to the web, even remotely can be accessed.  It isn't just computers and the information that they store, its also all the sensors, monitors, etc.,all equipment.  Recently a person hacked into another persons microwave.  Think about it.

However, with all the  negative reporting about data mining there is also a lot of wonderful things that can be done with data obtained from the web.  This has not been publicized as much. Data from cities such as street  lights, cab locations, police and fire stations, anything attached to the web can be analyzed.  The use of this data to more efficiently and effectively operate large cities has been written in a variety of articles (I will learn to put links into my posts soon).  An institute in Manhattan, I believe in cooperation with NYU, headed by physicist Steven Kooin, former provost of Caltech has been established to use the data collected in cities, particularly New York City to continue and extend such work to make cities operate better. There are a variety of models that can use such data to better predict future needs of a city.  With all of your medical data available on the web programs can be used to help  doctors understand and diagnosed  the problem  that prompted you to visit them. Better yet be able to help cut off problems before they start by understanding all your medical data including genetic information.

The availability of all of this data and how to effectively mine it and then interpret it is a new area of study with tremendous possibilities. Think of all the data and what can be done with it.  As we used to say years ago at one of the labs where  I worked "so much data, so little time".

No comments:

Post a Comment