The only ML model I created for my own usage ๐Ÿ˜€ (part 1 of 3)

Scrappy, Python

(Note:- The post was migrated from the previous blog written on 24th September 2018 web.arvhive.org). That was a lossy migration and the images were not be able to recover using webarchive. See What Happend to the previous blog

After playing with the homemade Cajon which I built (of cause father helped ๐Ÿ˜‡๐Ÿ˜‡ ),I eagerly wanted to have an acoustic drum set (This dream is there since Day 1).

This the Cajon that I built (The righter image shows the Cajon before polishing)

Well , I found a great wholesale seller from Alibaba and he agreed to send a single set for me (If you are interested contact me I can give you his WeChat contact). All seemed well until I realized the shipping prices and taxes are too high for a single set๐Ÿ˜‘. The government applies a tax around 45% for a set ๐Ÿ˜‘ . That moment I stopped dreaming about a brand-new set. Poor me then started checking second-hand products from Ikman. Most of the times all the ads are about electronic drums ๐Ÿ™ . But a lot of cool deals comes to Ikman and vanishes so quickly. Most of the times I forget to check Ikman regularly . This was the time that I was taking Data Mining & Information Retrieval and Machine Learning for semester seven (here are spiders that I developed the crawl newsfirst.lk ). So, I mixed everything and built a ML model to identify whether the new ad is personally matches my requirement. Here are the main things I did.

  1. Scraping the existing ads.
  2. Pre-processing the scraped data.
  3. Building the ML model and train it.
  4. Developing pipeline for classifying a new ad in real time.

1. Scraping the existing ads.

Here I Scraped all the adds related to drums which are currently available at Ikman (not a much ads were available as ikman is deleting each ad after 60 days). Well I used the knowledge that I gained from Data Mining and Information Retrieval to do this. I scraped around 600 ads from the site and saved them as json/csv. The following attributes are in an ads.csv.

  • Price โ€“ price of the drum item
  • Title โ€“ title of the ad
  • Link โ€“ URL of the advertisement.
  • Details โ€“ description the seller has given.
  • Location โ€“ Item location.

I used Python library called scrapy to scrape the ads. There are very nice tutorials available for learning scrapy and even the official documentation is easy to understand. In scrapy you can define โ€˜spidersโ€™ who crawl the webpages for you. I created a spider called โ€œikmanโ€ to crawl all the adds related to drums and drum items.

Here is a sample of the webpage in ikman you also follow this link to see the latest ads ๐Ÿ˜›

I wrote the following spider to get all the ads.

I would like to point out 2 main things about this spider.

I. It was required to get to page by page to crawl all the ads.

The โ€˜for loopโ€™ in line 14 is to loop and find every ad items in the crawled web page. The ads are in div class of โ€œui-itemโ€. Here I have used CSS to extract the elements in the page (you may need to use โ€˜XPATHโ€™ instead of CSS when you need extract the absolute path to the element e.g.- line 31). Line 23 -25 is used to go to the next page when the crawling of the current page is over.

II. It was required to go to inside of each ad to get the ad details.

If you watch the web page closely you would see that in order to get the ad description you need to click the ad link. To do that, We send another request to the extracted โ€˜linkโ€™ of the ad(line 20 -21). That response is separately handled in parse_next function.

In addition to that I would like to mention the following facts which could help you when building a spider.

I. Setting encoding format.

This sets the encoding method of the output file. In my case some ads are containing local language โ€œSinhalaโ€ and some weird emojis (Marketing Things ๐Ÿ˜…๐Ÿ˜…). If you want to capture these properly you have to set the following settings in the settings.py of your scrapy project.

FEED_EXPORT_ENCODING = "utf-8"

II. Donโ€™t check your output CSV (if you decide to CSV as the output file format) using ms excel.

I got into this trap and thought I have done something wrong while scraping. Non Unicode scrapped data are not shown properly MS Excel. You would see some senseless characters if you open it in Excel. Better try a code editor like Visual Studio Code (CSV is nothing but a just tabular format of keeping data. The columns are separated by commas and the rows are separated using newlines).

You can run following command to scrape the data. (Please keep those instructions in mind ๐Ÿ˜€)

scrapy crawl ikman -o scrapedData.csv

2 Pre-processing the scraped data.

price,title,link,details,location
"Rs 17,500",SABIAN PRO High hat 14 inches pair,https://ikman.lk/en/ad/sabian-pro-high-hat-14-inches-pair-for-sale-colombo,SABIAN PRO High hat pair 14 inches . superb condition.not used in Srilanka,Colombo
"Rs 64,000",ROLAND SPD SX,https://ikman.lk/en/ad/roland-spd-sx-for-sale-anuradhapura,Brand new.,Anuradhapura
"Rs 16,500",PAISTE 101 High Hat 14 inch,https://ikman.lk/en/ad/paiste-101-high-hat-14-inch-for-sale-colombo,Paiste 101 hihat pair. not used in Srilanka.Imported  from Japan,Colombo
"Rs 160,000",Drum kit ( Pearl ),https://ikman.lk/en/ad/drum-kit-pearl-for-sale-kalutara,Drum kit ( Pearl )8/10/12/16 toms.Hat 14,Kalutara
"Rs 16,000",LASER High hat 14 inch( Germany),https://ikman.lk/en/ad/laser-high-hat-14-inch-germany-for-sale-colombo,"Laser high hat pair , made in Germany. not used in Srilanka. Imported from Japan",Colombo

First take a copy of a scrapedData.csv (called training.csv) and add another column. Called โ€œoutputโ€. Now add โ€œYโ€ or โ€œNโ€ to each entry in the file to this attribute. โ€œYโ€ means Iโ€™m interested in this add and no is no Iโ€™m not ๐Ÿ˜€. Well I was the boss here so I had to tag all the data ๐Ÿ˜‘. Here is a sample of โ€˜training.csvโ€™.

price,title,link,details,location,output
"Rs 17,500",SABIAN PRO High hat 14 inches pair,https://ikman.lk/en/ad/sabian-pro-high-hat-14-inches-pair-for-sale-colombo,SABIAN PRO High hat pair 14 inches . superb condition.not used in Srilanka,Colombo, Y
"Rs 64,000",ROLAND SPD SX,https://ikman.lk/en/ad/roland-spd-sx-for-sale-anuradhapura,Brand new.,Anuradhapura, N
"Rs 16,500",PAISTE 101 High Hat 14 inch,https://ikman.lk/en/ad/paiste-101-high-hat-14-inch-for-sale-colombo,Paiste 101 hihat pair. not used in Srilanka.Imported  from Japan,Colombo, Y
"Rs 160,000",Drum kit ( Pearl ),https://ikman.lk/en/ad/drum-kit-pearl-for-sale-kalutara,Drum kit ( Pearl )8/10/12/16 toms.Hat 14,Kalutara, N
"Rs 16,000",LASER High hat 14 inch( Germany),https://ikman.lk/en/ad/laser-high-hat-14-inch-germany-for-sale-colombo,"Laser high hat pair , made in Germany. not used in Srilanka. Imported from Japan",Colombo, N

Now we have the data to train a model. I will be discussing about the model that I built to classify these advertisements in the next article. That is the most interesting part(After that I will discuss the pipelines that I developed to crawl live data and test them against the model using Scrapyd , Scrapyd-client, sendGrid on Bitnamy Hosting). Let me know anything if there is any unclear thing here. Wait for the Dark Art ๐Ÿ˜€ !

Avatar
Sudeepa Nadeeshan
Research Assistant

My research interests include Intelligent Transport Systems, Machine Learning.

Related