(This article appears on Kompas KLASS, December 2015. The second part can be found here)
Why is Google able to create a driverless car? How can Target retail chain detect customer’s pregnancy and offer a variety of pregnancy products at the same time?
This is big data phenomenon. Since the invention of a platform to share stories, photos, and videos, the amount of data incredibly has risen exponentially. When combined with all data from devices, cameras, and sensors in roads, buildings, public facilities, and factories, the data size will be multiplied many times. Your hard disk is probably in giga or tera capacity, but if you look at the data collected by Facebook, Google, YouTube, Twitter, and LinkedIn, they have been at the petabyte level.
For a data scientist, this amount of data is like a treasure trove which, when processed, will reveal a lot of things and even change the way we work in various fields.
So amazingly powerful is data science that the Harvard Business Review wrote an article “Data Scientist: The Sexiest Job in 21st Century.” It is not without basis. In 2011, McKinsey Global Institute predicted that in the United States alone there will be a shortage of nearly 200 thousand data scientists in 2018. No wonder companies often lure these scientists with huge salary.
A Major in Data Science
This relatively new discipline is the crossing of computer and statistical science. The objective is to discover new understanding from very large amounts of data through quantitative analysis to make decisions in various fields. There are structured data such as financial data or demographics, but there are also irregular data, such as e-mail, videos, photos, social media, and other content. Both types of data can create benefits.
In the early stage, called datafication, a variety of data from various sources have to be “prepared” to be readable to a computer program. This phase requires an expertise in the field of computer science. The skills required include for example Advanced Databases, Data Warehouses, Algorithm and the programming with Python, R, Hadoop, and its various tools.
Once the data is ready, the next stage involves more statistics, optimization, and mathematical reasoning. Naturally, students are required to master Statistics for Data Science, Bayesian Decision Theory, Predictive Analytics, as well as the probabilities and Data. It is the mastery of the various statistical skills that will reveal the “secret” behind a huge load of data.
The benefits of big data can be very diverse. For example, in the US, there are approximately 25 million people with asthma. Social Media Data Analytics reveals that those who have their asthma relapse generally twit their status even before reaching for their inhaler.
As we all know, Twitter records time and location of any “tweet”. By filtering out tens of millions of tweets by hash tags, links, and keywords, many patients’ locations can be mapped. Thus, hospitals in the region have time to prepare physicians, asthma medication, beds or rooms, before those patients arrive.
“The internet of things”
When I met a wireless engineering expert from Macquarie University in Australia, Prof Eryk Dutkiewics, he mentioned that 5G technology would soon be released. With this technology, not only will data of phone conversation increase, but also data from millions of sensors, cameras, and other electronic devices. In the era of ‘internet of things’, a lot of devices are connected to the Internet and send data. An analysis of these machine-generated data can reveal a lot of new insights.
In the health sector, for example, data on thousands of patients are presented in computer along with the predictions. Doctors hence can find out why certain drugs are effective in one patient but not in others. Various tools that are installed on a patient’s body supply millions of important data impossible to be discerned without data science.
Improving the Machine
With certain algorithms, certain sets of data are correlated with others statistically. When the amount of data increases, the computer generates more correlations. In short, the computer gets smarter when supplied with more data. This is called machine learning. There are many applications of this principle, from the simple one to a driverless car and Google Translate service.
In the case of Google Translate, Google does not translate a text word by word. Taken from various international conferences, scientific publications, and library collections, each text is put side by side with its translation, and then stored in the digital form. Every phrase and sentence is linked to its translation, and then the computer looks for the correlations.
Over time, millions of texts have accumulated. Computers are getting increasingly smarter and able to produce a better translation. That is the reason machine learning is one of the main subjects in data science. One day, Google may be able to produce accurate translation, including translating conversations, replacing the profession of translators, similar to what will befall many other professions due to technology.
Data science can produce optimization, for example, in designing the most efficient marketing campaigns. The application also produces predictive analytics, such as predicting events or anticipating the demand of certain goods in the future.
With data science, we can understand customer behavior more deeply. For example, Amazon.com successfully develops a system that recommends several other items to purchase for its unique visitors. Even more interesting is the ability of data science to detect financial fraud, and even to automate a car without a driver as Google has done.
One large insurance company, Aviva, measures the insurance risk of an applicant by his lifestyle. The data come from his hobbies, what website pages he visits, how often he watches TV, what shows he watches, revenue forecasts, and more. This way, Aviva only spends 5 dollars. On the other hand, if the company uses a blood test and a urine sample, it must bear the cost of $125 per customer. What an incredible savings!
Aviva is not the only “creeper” of personal data. Amazon, INRIX, Netflix, Target, and many others do the same thing. The biggest issue for those in the field of big data is in the aspect of confidentiality and privacy. To what extent does a company or organization have the right to play around with our personal data?
In response, at Columbia University, USA, there is Data Science Capstone and Ethics course in the Master of Science in Data Science. In this course students will apply all their knowledge to solve problems in industry, government, and nonprofit sectors. The objective of this one-semester project is to bring together statistic, computing, engineering, and social problems to find solutions to real world problems in an ethical manner.
The second part of this article can be found here.