This blog post was originally published at Digica’s website. It is reprinted here with the permission of Digica.
When you want to take the best possible care of your brain, certain things are recommended, such as fatty fish, vegetables, doing some brain exercises, and learning new things. But what about artificial neural networks? Of course, they don’t need human food, but they need the best quality and quantity of data. For this purpose you need to ask yourself exactly what task you want to solve, and where can you find relevant datasets. Are you searching datasets that apply to specific tasks like “medical image segmentation” or “entity recognition”? Or maybe you are searching datasets in a particular field of science, such as ion spectrometry or ECG signals.
Here are some possible sources of the best datasets for your machine learning project:
Google dataset search [https://datasetsearch.research.google.com/]
In 2018, Google launched a search engine for datasets. From my perspective as Data Scientist, it was a game changer. Finally, I have a dedicated web search engine for datasets. You can search through datasets based on their description on the relevant webpages. Many of the datasets have descriptions of measurement processes and content. Also, you have direct links to the datasets and information about the type of dataset (.zip, .xml, .pdf, .csv, etc.). Some of the datasets even have information about the type of license that applies to the dataset. As you know, license information is crucial, especially in commercial projects.
I would say that every Data Scientist knows of this website, and has visited it at least. You may have heard of Kaggle as a platform where you can compete against others in Data Science problems. But have you ever thought about it as a good source of datasets?
On Kaggle you can search datasets based on main categories such as Computer Vision or Natural Language Processing, and also based on file types, size of datasets or type of licenses. If an interesting dataset is in table format, you can look into a summary of each column to see, for example a histogram of numerical values or a list of unique strings. You can even do basic filtering in this dataset or have a summary of samples number per class, so that you can check if your potential dataset is balanced or not.
UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php]
On this website, you can search datasets based on a number of criteria: type of task (classification, regression, clustering), type of data attributes (categorical, numerical or mixed), data type (text, sequential, time-series, multivariate) or area of datasets. Each dataset has a description and links to be able download the dataset. Many of the datasets published on this website were already used in scientific research, which means that they have high quality descriptions and also have a list of papers where they are mentioned. This is useful because, with this information, you can check whether someone has already done what you want to do. Unfortunately, there are only 622 datasets there.
Papers with code [https://paperswithcode.com/datasets]
For me, this is one of the best web pages for every Data Scientist and I’m excited about explaining to you why I feel that way. If you want to do some quick research on a specific topic, for example, image semantic segmentation, you can search this web page by using an existing tag or searching in the browser. You can search papers with an implementation that is stored in GitHub along with the datasets on which any models were trained. For the most common task, you have a plot with summary results and a list of the top five solutions with links to the repository and the name of the deep learning framework (usually, PyTorch and TensorFlow).
So, on this website, when you’re looking for the best dataset, you can also search for the best architecture for the problem that you want to solve. I highly recommend this website if you are doing a research task and have a summary of existing methods with implementations.
Hugging face [https://huggingface.co/docs/datasets/index]
Hugging Face is the platform with datasets, pre-trained models and even “spaces” where you can check how the solution works on some examples. Some of the solutions presented on the platform have access to the GitHub implementation. Also, there are some tutorials and how-to guides, which makes this platform appropriate for less experienced data scientists. I can especially recommend this website for projects related to NLP and audio processing because there are so many datasets for multiple languages. However, if you are more interested in computer vision, then you can also find interesting things on this website.
Create your own dataset
Sometimes you want to use data from specific web services, for example, Spotify, or websites such as social media, for example, Twitter. Some of these services or sites have an API, so you only need create a developer account, and download the relevant data. Using an API usually requires you to have authorisation, in most cases, OAuth, which gives you, as a developer, the required access token, access secret, etc. In Python, you also have special packages dedicated to specific platforms, such as, for example, Spotipy (to access Spotify), Tweepy, (to access Twitter), or python-facebook-api. Those packages have built-in functions for a specific type of request, like “Find account names which contain the specific substring”, or “Get 10 most popular tweets about football”. For each API you need to review the documentation to be sure that you get the data which you want. You also need to remember to preprocess the data before using it in your machine learning and deep learning model.
If the service from which you want to get data does not have its own API, or you just want to take data from, for example,. a shopping website you can do, so-called,web scrapping. For that, you need to know a little about HTML, and it helps if you can write your own regular expressions. Web-scraping is probably more time-consuming than the other ways describe above in this article, but sometimes web-scraping is the only possible way to get the data that you want. For web scraping in Python, I recommend using the Beautiful Soup package.
Summing up, as a Data Scientist you have access to many sources of datasets. Some sources contain well-prepared, high quality datasets, whilst other datasets are more challenging to source, but you often have more control over the latter sources of data.You will have to choose the strategy which works best in a given context.I hope that this article gives you the information that you need to take an informed decision about the best approach.
Data Scientist, Digica