Get the highlights in your inbox every week.
What is data science? | Opensource.com
What is data science?
Data science is a broad field that's expanding rapidly across many industries. Here's what you need to know about it.
Data science is a branch of computer science dealing with capturing, processing, and analyzing data to gain new insights about the systems being studied. Data scientists deal with vast amounts of information from different sources and in different contexts, so the processing they must do is usually unique to each study, utilizing custom algorithms, artificial intelligence (AI), machine learning, and human interpretation. It's a broad field that's expanding rapidly across many industries, including medicine, astronomy, meteorology, marketing, sociology, visual effects, and much more.
Why is data science important?
Science is based on gathering evidence and interpreting the evidence to draw logical conclusions. This principle has served civilization well enough to enable trans-Atlantic flights, telephony, disease treatments, landing rovers on the surface of Mars, and much more. In the modern world, a proliferation of data is being gathered. Data about lifestyle habits, dietary preferences, music choices, purchasing habits, energy consumption, weather systems, migratory patterns, seismic activity, flight times, and so much more. Computers are everywhere, so there's almost constant input into a pool of big data.
That's more information about the world around us than we've ever had access to, and it's spread across a wider sample set than ever. Analyzing large data sets can lead to surprising revelations. Sometimes patterns and correlations are found in places not previously expected or that had only been theorized before. Observing and analyzing the environment is important for humans to learn, grow, and become a better-informed species. A lot of data science is applied to frivolous pursuits—and sometimes ethically questionable ones—but there is just as much analysis happening around worthwhile, healthy, and helpful causes that open source should be proud to support.
And it turns out that open source software is vital to the growth and development of data science.
Because of the vast amount of data that data science analyzes, the field requires a solid computing infrastructure. The datasets involved in serious data science are often too large to process on a single machine or even a small cluster, so hybrid clouds are used to store and process information and to make correlations among what's been parsed. This means that a data scientist's toolbox includes a platform like OpenShift for running processing services, distributed computing software like Apache Hadoop or Apache Spark, a distributed file system like Ceph or Gluster for scalable and highly available storage, and so on. A data scientist's job is as much about statistics and math as it is programming and computer engineering.
What does a data scientist do?
A data scientist gathers data, parses and normalizes it, and then creates routines for a computer to run on the data in search of a pattern, trend, or just a helpful visualization. For instance, if you have ever created a pie chart or bar graph from the fields of a spreadsheet, then you've acted as a low-level data scientist by interpreting a dataset and visualizing the data to help others understand it.
When data is being analyzed for patterns, there's no way to tell a computer what to look for (because "what to look for" hasn't been found yet). While AI and machine learning can scrub vast datasets to find arbitrary patterns, it takes human ingenuity to look for the irrational and interpret what's found. That means data scientists must be able to design custom routines with programming languages like Python, R, Scala, and Julia. They must be familiar with important libraries, like Beautiful Soup, NumPy, and Pandas, so they can scrape, sanitize, and organize data. They need to be able to version-control and iterate upon their code so they can mature and develop the way they look at data as they continue to understand the relationships they discover.
How to start learning data science
Data science is a career, so you can't learn everything you need to know in a year or two of study and call yourself a data scientist. Instead, start studying now, maybe on your own or maybe through formalized training, and then apply what you've learned in a real-world situation. Repeat that process until you have either solved all of the world's problems or retire.
Fortunately, data science is largely driven by open source software that is freely available to everyone. A good first step is to try a Linux distribution, as it can serve as a good platform for your work. Linux is an open source operating system, so it's not only free to use, but it's uncommonly flexible, making it ideal for a field known for its constant need to adapt. Linux also ships with Python, which is a leading language in data science today. The NumPy and Pandas libraries are specifically designed for number crunching and data analytics, and their documentation is very thorough.
As is often the case, though, one of the greatest struggles when learning a new language or library is finding a way to apply the tools to something in your life. Unlike many other disciplines, there are no wrong answers in data science. You can apply the principles of data science to any set of data. At worst, you'll discover that there's no correlation between two sets of data or that there's no pattern in a seemingly random event. But that's valid research, so not only will you have learned about data science, you'll also have proven or disproven a hypothesis.
Thanks to the influence of open source, open data sets are easy to find. There are data sets available from Data.gov, the World Bank, Google (including data from NASA, GitHub, the US Census, etc.), and many more. These are excellent resources you can use to learn how to scrape the web for data, parse it into a format you can easily process, and analyze it with specialized libraries.
Why use Python for data science?
You can use several different languages for data science, but Python is one of the most popular. Nearly any language is capable of analyzing data, but some languages and libraries are designed with certain expectations; for instance, the NumPy library provides tools for processing matrices so that you don't have to write a matrix library on your own.
Julia and Jupyter
Python isn't the only language capable of analyzing data, and in fact there are many others out there that arguably surpass it. The Julia language is popular with data scientists for its focus on performance and data visualization. Julia's popularity was noticed by the developers of iPython, an interactive development environment. inspiring them to change the project name to Jupyter, an intentional amalgamation of Julia/Python/R.
Today, Jupyter notebooks are used for interactive computing, which allows data scientists to get instant feedback (both in code and visually) while they code. Viewing someone's Jupyter notebook can be a multimedia experience, with documentation, plus the source code, all in the same place. It's a powerful tool, but easy enough to get started on even if you're just learning to code.
Data science and the future
As computers continue to proliferate, available data grows. If you're the sort who wants to understand how the world works, there's no better way to start than data science. And whatever you do in data science, remember to keep it open so everyone benefits.