The only ML model I created for my own usage ๐ (part 1 of 3)
Scrappy, Python
(Note:- The post was migrated from the previous blog written on 24th September 2018 web.arvhive.org). That was a lossy migration and the images were not be able to recover using webarchive. See What Happend to the previous blog
After playing with the homemade Cajon which I built (of cause father helped ๐๐ ),I eagerly wanted to have an acoustic drum set (This dream is there since Day 1).
This the Cajon that I built (The righter image shows the Cajon before polishing)
Well , I found a great wholesale seller from Alibaba and he agreed to send a single set for me (If you are interested contact me I can give you his WeChat contact). All seemed well until I realized the shipping prices and taxes are too high for a single set๐. The government applies a tax around 45% for a set ๐ . That moment I stopped dreaming about a brand-new set. Poor me then started checking second-hand products from Ikman. Most of the times all the ads are about electronic drums ๐ . But a lot of cool deals comes to Ikman and vanishes so quickly. Most of the times I forget to check Ikman regularly . This was the time that I was taking Data Mining & Information Retrieval and Machine Learning for semester seven (here are spiders that I developed the crawl newsfirst.lk ). So, I mixed everything and built a ML model to identify whether the new ad is personally matches my requirement. Here are the main things I did.
- Scraping the existing ads.
- Pre-processing the scraped data.
- Building the ML model and train it.
- Developing pipeline for classifying a new ad in real time.
1. Scraping the existing ads.
Here I Scraped all the adds related to drums which are currently available at Ikman (not a much ads were available as ikman is deleting each ad after 60 days). Well I used the knowledge that I gained from Data Mining and Information Retrieval to do this. I scraped around 600 ads from the site and saved them as json/csv. The following attributes are in an ads.csv.
- Price โ price of the drum item
- Title โ title of the ad
- Link โ URL of the advertisement.
- Details โ description the seller has given.
- Location โ Item location.
I used Python library called scrapy to scrape the ads. There are very nice tutorials available for learning scrapy and even the official documentation is easy to understand. In scrapy you can define โspidersโ who crawl the webpages for you. I created a spider called โikmanโ to crawl all the adds related to drums and drum items.
Here is a sample of the webpage in ikman you also follow this link to see the latest ads ๐
I wrote the following spider to get all the ads.
I would like to point out 2 main things about this spider.
I. It was required to get to page by page to crawl all the ads.
The โfor loopโ in line 14 is to loop and find every ad items in the crawled web page. The ads are in div class of โui-itemโ. Here I have used CSS to extract the elements in the page (you may need to use โXPATHโ instead of CSS when you need extract the absolute path to the element e.g.- line 31). Line 23 -25 is used to go to the next page when the crawling of the current page is over.
II. It was required to go to inside of each ad to get the ad details.
If you watch the web page closely you would see that in order to get the ad description you need to click the ad link. To do that, We send another request to the extracted โlinkโ of the ad(line 20 -21). That response is separately handled in parse_next function.
In addition to that I would like to mention the following facts which could help you when building a spider.
I. Setting encoding format.
This sets the encoding method of the output file. In my case some ads are containing local language โSinhalaโ and some weird emojis (Marketing Things ๐ ๐ ). If you want to capture these properly you have to set the following settings in the settings.py of your scrapy project.
FEED_EXPORT_ENCODING = "utf-8"
II. Donโt check your output CSV (if you decide to CSV as the output file format) using ms excel.
I got into this trap and thought I have done something wrong while scraping. Non Unicode scrapped data are not shown properly MS Excel. You would see some senseless characters if you open it in Excel. Better try a code editor like Visual Studio Code (CSV is nothing but a just tabular format of keeping data. The columns are separated by commas and the rows are separated using newlines).
You can run following command to scrape the data. (Please keep those instructions in mind ๐)
scrapy crawl ikman -o scrapedData.csv
2 Pre-processing the scraped data.
price,title,link,details,location
"Rs 17,500",SABIAN PRO High hat 14 inches pair,https://ikman.lk/en/ad/sabian-pro-high-hat-14-inches-pair-for-sale-colombo,SABIAN PRO High hat pair 14 inches . superb condition.not used in Srilanka,Colombo
"Rs 64,000",ROLAND SPD SX,https://ikman.lk/en/ad/roland-spd-sx-for-sale-anuradhapura,Brand new.,Anuradhapura
"Rs 16,500",PAISTE 101 High Hat 14 inch,https://ikman.lk/en/ad/paiste-101-high-hat-14-inch-for-sale-colombo,Paiste 101 hihat pair. not used in Srilanka.Imported from Japan,Colombo
"Rs 160,000",Drum kit ( Pearl ),https://ikman.lk/en/ad/drum-kit-pearl-for-sale-kalutara,Drum kit ( Pearl )8/10/12/16 toms.Hat 14,Kalutara
"Rs 16,000",LASER High hat 14 inch( Germany),https://ikman.lk/en/ad/laser-high-hat-14-inch-germany-for-sale-colombo,"Laser high hat pair , made in Germany. not used in Srilanka. Imported from Japan",Colombo
First take a copy of a scrapedData.csv (called training.csv) and add another column. Called โoutputโ. Now add โYโ or โNโ to each entry in the file to this attribute. โYโ means Iโm interested in this add and no is no Iโm not ๐. Well I was the boss here so I had to tag all the data ๐. Here is a sample of โtraining.csvโ.
price,title,link,details,location,output
"Rs 17,500",SABIAN PRO High hat 14 inches pair,https://ikman.lk/en/ad/sabian-pro-high-hat-14-inches-pair-for-sale-colombo,SABIAN PRO High hat pair 14 inches . superb condition.not used in Srilanka,Colombo, Y
"Rs 64,000",ROLAND SPD SX,https://ikman.lk/en/ad/roland-spd-sx-for-sale-anuradhapura,Brand new.,Anuradhapura, N
"Rs 16,500",PAISTE 101 High Hat 14 inch,https://ikman.lk/en/ad/paiste-101-high-hat-14-inch-for-sale-colombo,Paiste 101 hihat pair. not used in Srilanka.Imported from Japan,Colombo, Y
"Rs 160,000",Drum kit ( Pearl ),https://ikman.lk/en/ad/drum-kit-pearl-for-sale-kalutara,Drum kit ( Pearl )8/10/12/16 toms.Hat 14,Kalutara, N
"Rs 16,000",LASER High hat 14 inch( Germany),https://ikman.lk/en/ad/laser-high-hat-14-inch-germany-for-sale-colombo,"Laser high hat pair , made in Germany. not used in Srilanka. Imported from Japan",Colombo, N
Now we have the data to train a model. I will be discussing about the model that I built to classify these advertisements in the next article. That is the most interesting part(After that I will discuss the pipelines that I developed to crawl live data and test them against the model using Scrapyd , Scrapyd-client, sendGrid on Bitnamy Hosting). Let me know anything if there is any unclear thing here. Wait for the Dark Art ๐ !