Working with data differs from a typical workflow. When getting your hands on a dataset for the first time, oftentimes you have little idea what to expect from it, especially if you were not the one who collected the data. Are there missing values? Are there duplicate records? Is there erroneous data? Are there outliers and/or anomalies? To figure out whether data can be useful, you need to explore it first, and there’s no single exploration path that will work for all datasets.
One tool that can help you iteratively explore, visualize, and get familiar with your data is Jupyter Notebook.
In this tutorial, you’ll learn what Jupyter Notebook is, what are pros and cons of working with it, how you can install it and set up a Kotlin kernel, and how you can start exploring and visualizing your data with it.
SEE ALSO: Coding habits for data scientists
So what is Jupyter Notebook, anyway?
The Jupyter Notebook is an open-source web application that allows you to create and share documents (often referred to as simply “notebooks” ) that can contain code, visualizations, and markdown text. Notebooks consist of cells that you can run one at a time, and you can combine different types of cells in a single notebook. For instance, you can complement your research code with narrative and equations.
Jupyter Notebook is an extremely popular tool in the Data Science community. It is quite versatile, and can serve many purposes: it can be used for teaching, data wrangling and analysis, creating reports for business stakeholders, training Machine Learning models, and so on. There’s a number of reasons to Jupyter’s popularity:
- Interactive environment is convenient for quick experiments: whether you’re trying out a library, or familiarizing yourself with a new data set, or trying out several machine learning algorithms, you can iterate through your ideas quickly in a notebook.
- It’s easy to set up, and you can spin up Jupyter Notebook either locally, or on a powerful remote machine to run some resource-hungry computations.
- Jupyter makes it easy to visualize data: explore the distribution of your data, find outliers, plot trends, explore correlations, etc.
- Notebooks make it easy to share your findings with colleagues.
- Finally, Jupyter supports multiple programming languages: Python, R, Scala, Java, and now Kotlin.
While Jupyter Notebook is great for a good number of tasks, no tool is perfect. It is important to understand its limitations as well as its strength. Here are a few things to be aware of when using Jupyter Notebook:
- Jupyter allows you to execute cells out of order, which quickly makes things messy.
- Most common software engineering practices, like testing, debugging and version control, are rather difficult when it comes to Jupyter Notebook which makes it ill-suited for production code.
- If you’re used to advanced IDE functionality like navigation, error analysis, code completion, documentation lookup, and so on, most Jupyter kernels will only have a small subset of such functionality which can be frustrating.
In other words, just as with any tool, when used for the right purpose Jupyter Notebook can be a great addition to your tool arsenal.
If you are looking to explore a data set, build up your data science and machine learning skills, or play with some new libraries, but don’t want to leave the comfort of the JVM world in exchange for dynamic Python, I’ve got some great news for you! You can do all of those things using Kotlin Kernel for Jupyter Notebook.
Getting up and running
To set up Jupyter Notebook itself, first you need to have Python installed on your machine. Then, the easiest way to get started is to install anaconda distribution that includes Jupyter Notebook.
For the Kotlin kernel to work, make sure you have Java v.8 or later installed. Once you do, add the Kotlin Kernel with the following command:
conda install -c jetbrains kotlin-jupyter-kernel
Start your Jupyter Notebook with
jupyter notebook
Once Jupyter Notebook starts and opens in your browser, you are all set to start creating new notebooks with Kotlin kernel:
Working with a Kotlin notebook
Now, let’s see what you can do in a Jupyter Notebook with a Kotlin Kernel.
Kotlin Kernel supports a number of libraries commonly used for working with data, such as krangl, Spark, kmath, Exposed, deeplearning4j, and more. You can start using these libraries by adding a simple line magic:
%use [library1]
For this tutorial, I’m going to use krangle library for working with tabular data, and lets-plot, a visualization library:
Now we need data. You can find many datasets to play with on kaggle, like this one about ramen ratings.
Unlike most kernels, Kotlin Kernel gives you code completion as you type:
It can even help you with paths:
Once we’ve read the data into a DataFrame, we can start poking it to see what it looks like. A couple of things I typically look at first are the schema and a few rows.
Already at this point we can notice that for some reason, the ratings themselves inferred to be String type. That might be due to some weirdness in the data itself. Exploring various datasets you’ll encounter all sorts of strange things. Some of them are easy to fix, like in this case. Turns out three records have a rating of “Unrated”, and since there are so few of them, it’s easier to just drop those records.
We can now confirm that the data has been fixed, and, say, check what the average rating of ramen if:
Of course these types of manipulations we could’ve performed even without a Jupyter Notebook, but here’s where Jupyter really shines – we can visualize the data.
SEE ALSO: Challenges with Managing the Exploding Data Firehose
Certain things are much easier to understand when looking at a plot rather than simple stats, for instance, data distribution. Let’s see what the ratings distribution looks like:
Of course, you can have questions to your data that can require some data manipulation, like grouping. For example, let’s find out how many unique Ramen brands are there per country.
These were just a few examples to give you a taste of what you can do with Jupyter Notebook and Kotlin Kernel. As every data wrangler’s path is different, I would encourage you to grab a dataset that interests you and explore it!
The post Exploring data using Jupyter Notebook and Kotlin appeared first on JAXenter.
Source : JAXenter