KT Companion: Bonus workshop questions: Week 1

Reasoning about data is crucial for Knowledge Technologies.

Consider the bar chart about the projected growth of structured and unstructured data in the article "Data Science and Prediction" by Vasant Dhar. (The article is linked in the readings on the LMS; also the graph was reproduced on the slide "Some Data about Data" from Lecture 1 (under the Fair Use provision of the Australian Copyright Act).)

a) The height of the bars indicate the total capacity of archived data for the years 2008-2015 - there seems to be an exponential increase. Do you expect this to continue? Why or why not?

b) The share of the total archived data belonging to databases has decreased over the years in question: in 2008, this was about 13% of the total data; in 2015, it is projected to be less than 11%. Why do you think this is happening?

(Answers below the fold.)
a)

The greatest shortcoming of the human race is our inability to understand the exponential function. - Albert A. Bartlett

I don't even need to look at the graph to tell you that the exponential growth will continue in the short term but plateau in the long term. You might wish to familiarise yourself with Moore's Law for both the reason why this is happening and also why it will stop.
More fundamentally, any exponential growth can't continue forever - the numbers are just too large.^{[citation needed]} If I fit an exponential curve to the given data, I find out that the year-over-year change is roughly 50%. The weight of a 1TB hard disc today is maybe 500g; even taking into account fairly extreme improvements in storage density, the mass required to store the data would weigh as much as the Earth by 2125, and as much as the matter in the observable universe by 2285!

b) We actually need to look at the graph for a few seconds for this question. Observe that the scale is in petabytes (1000s of terabytes). I don't think it's exaggeration to describe this as lots of data.
Generally speaking, for a database to be useful, it needs to be designed and maintained by a human with some technical knowledge.^{[citation needed]} What's the problem? Well, we're producing (or, perhaps more accurately, storing) data much more quickly than we're generating people who have the time to store the data in databases! (This observation has been made in a variety of places; for example Grossman et al. (2001))

References
Dhar, Vasant. (2013). "Data Science and Prediction". In Communications of the ACM. 56(12). pp. 64-73. doi:10.1145/2500499
Grossman, Robert, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar and Raju Namburu. (2001). Data Mining for Scientific and Engineering Applications. Norwell, USA: Kluwer.

KT Companion

Tuesday, 29 July 2014

Bonus workshop questions: Week 1

No comments:

Post a Comment