It has been 3 indexes until here and the dataset problem is the hardest problem to solve for this problem. In this article, we will solve that problem (or at least try 🙂 )
Generally, it seems easy to collect data online for an ML project. There are multiples methods affordable for anyone for any use case (at least it is what we thought at the beginning)
We genuinely think that for any machine learning project we can:
- Web scraping
- GitHub
- Online free dataset
- …
Web scraping
This seems to be the most common way to get data online. We wanted to do exactly the same thing with dixto. Crawling websites, getting comments and then manually labelling those comments as spam and non-spam. But it is not that simple because of many reasons. First, crawling most websites are illegal, it is a hard process, your IP can be banned because of the amount of request you will be sending, and much more. In our case, the hard work is to get the spam and manually label them. And with that solution, we will need a lot of computer resources and also a lot of manual work to find the websites and label those comments.
So we have decided not to go with that solution right now. But it can work with a lot of other use cases.
GitHub
GitHub is a really good source of information for people looking for free and open-source datasets and projects. Many companies and organisations decide to publish some datasets online for free mostly for non-profit use-cases. but you can also find some interesting datasets for the use-case you have.
In the dixto use-case, we have found some interesting email spam datasets on Github like that one
Online free dataset
Most of the datasets we had were mainly free datasets published by universities, organisations and people online. It took a couple of days to browse the internet looking for datasets but we have finally found some interesting websites with gigabytes of text content for free that can be used for any use case. That is a really great source for dataset research online.
More importantly, the methods that help us to find all those datasets were really simple and we thought it was interesting to share them with you during the process. It is mainly the content of the video index on youtube
Method to find free real-world datasets online
A lot of research department in most countries has a good budget for everything that can help them develop and benchmark new ideas and more importantly in the data field. It is really important for researchers to test their solution with real-world data and in that case, they take a good amount of time doing the whole dataset task that you as a data scientist can be doing alone with no resources.
In our case, we have to browse a huge amount of research papers to see the ones that were completely focused on the dataset for our use-cases and then the ones that propose a solution for the problem we are trying to solve and we had to look at the dataset they have used to benchmark their solution. That solution gives us Giga bites of free datasets, clean, used in the real world and well organised for the exact use case.
We went to google scholar, research gate, … and browse the research paper related to spam filtering and then search for the reference section mentioning the dataset they have used.
And you can do the same for most of the use-case and problems you may want to solve using data science.
We hope this index was useful for you also. Thank you and see you in the next index.
Leave a Reply