#+options: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline author:t #+options: broken-links:nil c:nil creator:nil d:(not "LOGBOOK") date:t e:t #+options: email:nil f:t inline:t num:t p:nil pri:nil prop:nil stat:t tags:t #+options: tasks:t tex:t timestamp:t title:t toc:t todo:t |:t #+title: Spare Room Manchester #+date: today #+author: Craig Oates #+email: craig@craigoates.net #+language: en #+select_tags: export #+exclude_tags: noexport #+creator: Emacs 28.2 (Org mode 9.5.5) #+cite_export: * Setup Common Lisp Environment You will not need to execute this code block if you've already set up SLIME in another ORG file. This is just in case this is the only file you're working on today, or it's your first file of the day. *Run ~m-x slime~ before running the following code.* And, make note of the ~:session~ attribute. It allows you to use the code in the code block to be use in other code blocks which also use the ~:session~ attribute. #+begin_src lisp :session :results silent (ql:quickload :com.inuoe.jzon) ; JSON parser. (ql:quickload :dexador) ; HTTP requests. (ql:quickload :plump) ; HTML/XML parser. (ql:quickload :lquery) ; HTML/DOM manipulation. (ql:quickload :lparallel) ; Parallel programming. (ql:quickload :cl-ppcre) ; RegEx. library. (ql:quickload :plot/vega) ; Vega plotting library. (ql:quickload :lisp-stat) ; Stat's library. (ql:quickload :data-frame) ; Data frame library eqv. to Python's Numpy. (ql:quickload :str) ; String library, expands on 'string' library. #+end_src * Gather Spare Room Data Having done a quick manual search on [[https://www.spareroom.co.uk][Spare Room]], using =Manchester= as the location/search term, the results page says there are =Showing 1-10 of 1000+ results=. So the total page count is, #+begin_src calc :results raw 1000 / 10 #+end_src #+RESULTS: 100 (obvious, but kept for the sake of completeness) Because I’m going to grabbing that many pages, it’ll be better if I do this once and save the raw HTML on my computer. That way, I can parse and sift through the data without putting needless stress on the Spare Room servers – by repeatedly scraping and parsing the data as I work out the bugs. #+begin_src shell :results silent # If you don't already have this directory. It includes files ignored by Git. mkdir "raw-data/external" #+end_src #+begin_src shell cd raw-data/external DIRECTORY="$(date '+%Y-%m-%d')_spare-room-manc" mkdir $DIRECTORY for OFFSET in {0..990..10} do curl -o "$DIRECTORY/spare-room-manc-$OFFSET.html" \ "https://www.spareroom.co.uk/flatshare/index.cgi?offset=$OFFSET&search_id=1281027880&sort_by=price_low_to_high&mode=list" sleep 5 done # Change back to the project's root directory, so I don't call code whilst still # in this directory. cd ../../ #+end_src * Clean Up and Parse Data ** Failed Attempt (Parse Whole Page) This is my initial attempt to parse the HTML files I gathered from Spare Room's website. #+begin_src lisp :results silent ;; This is here for future reference, I don't recommend you run it. (loop for file-path in (directory #P"raw-data/external/2024-02-23_spare-room-manc/spare-room-manc-0.html") do (with-open-file (stream file-path) (let* ((doc (plump:parse stream)) (locations (lquery:$ doc ".listingLocation")) (prices (lquery:$ doc ".listingPrice")) (data (loop for loc across locations for price across prices collect (list (plump:text loc) (plump:text price))))) (with-open-file (stream #P"working-data/2024-02-23-spare-room-manc.csv" :direction :output :if-exists :supersede) (format stream "Location,Price~%") (dolist (item data) (format stream "~a,~a~%" (first item) (second item))))))) #+end_src Unfortunately, this approach produces inconsistent results. There are too many lines of text with excessive spacing. There are newlines with no rhyme or reason keep breaking the formatting, also. Having had a play around with the data, I don't feel confident each entry has the rent price in the same place. So, I think there is a chance the code which processes the files will produce misaligned data. In other words, the rent for one location might be placed next to the location which came before it in the processing pipeline (i.e. an /off-by-one/ bug). Because of this, I think a more secure approach would be to extract each listing into its own file and process one file at a time. This will, hopefully, make it easier to identify if a listing has missing or ill-formed pricing, location etc. data. ** Separate Each Listing into Their Own File #+begin_src shell mkdir "raw-data/external/2024-02-23_spare-room-manc-listings/" #+end_src #+begin_src lisp :results silent (let ((counter 0)) (loop for file-path in (directory #P"raw-data/external/2024-02-23_spare-room-manc/*.html") do (with-open-file (in-stream file-path) (let* ((doc (plump:parse in-stream)) (listings (lquery:$ doc ".listing-result" (serialize)))) (loop for item across listings do (let ((out-path (merge-pathnames "raw-data/external/2024-02-23_spare-room-manc-listings/" (format nil "listing-~a.html" (write-to-string counter))))) (with-open-file (out-stream out-path :direction :output :if-exists :supersede) (format out-stream "~a" item)) (incf counter))))))) #+end_src * Create CSV of Listings #+begin_src lisp :results silent (with-open-file (out-stream #P"working-data/2024-02-23-spare-room-manc.csv" :direction :output :if-exists :supersede) (format out-stream "ROW-ID, LISTING-ID, LOCATION, POSTCODE, RAW-PRICE, PRICE, FREQUENCY, PRICE-DETAILS, URL~%") (let ((row-id 0)) (loop for file-path in (directory #P"raw-data/external/2024-02-23_spare-room-manc-listings/*.html") do (with-open-file (in-stream file-path) (let* ((doc (plump:parse in-stream)) (listing-id (lquery:$ doc ".listing-result" (attr "data-listing-id"))) (price-details (lquery:$ doc ".listingPriceDetails" (text))) (location (lquery:$ doc ".listingLocation" (text))) (url (lquery:$ doc "article" "a" (attr "href"))) (postcode (lquery:$ doc ".listing-result" (attr "data-listing-postcode"))) (price (lquery:$ doc ".listingPrice" (text))) (cleaned-price (if (str:contains? "," (aref price 0)) (str:replace-all "," "" (aref price 0)) (aref price 0)))) (format out-stream "~a,~a,~a,~a,~a,~a,~a,~a,~a" (write-to-string row-id) (aref listing-id 0) (aref location 0) (aref postcode 0) cleaned-price (first (cl-ppcre:all-matches-as-strings "\\d+" cleaned-price)) (first (cl-ppcre:all-matches-as-strings "(pw|pcm)" cleaned-price)) (aref price-details 0) (format nil "https//www.spareroom.co.uk~a~%" (aref url 0))) (incf row-id)))))) #+end_src * Explore CSV Data for Spare Room Manchester (2024-02-23) I tried to create a data-frame with the CSV file but lisp-stat couldn't read it. Having had a quick glance over it, there are rows with the following entry in them, =this room is no longer available');=. The error, for lisp-stat, is this bit of text has been separated from the =URL= column, leaving an odd number or column entries on some rows. | URL | (Extra Row, Not Needed) | |----------------------------------------------------------------------------------+---------------------------------------| | =https//www.spareroom.co.uk/flatshare/flatshare_detail.pl?flatshare_id=17119950= | | | =https//www.spareroom.co.ukjavascript:alert('Sorry= | =this room is no longer available');= | Run the following script to remove these entries from the CSV file. #+begin_src shell :results silent # CSV file to process. csv_file="working-data/2024-02-23-spare-room-manc.csv" # Temporary file for intermediate processing. temp_file="working-data/2024-02-23-spare-room-manc-temp.csv" # String to search for... search_string="this room is no longer available" # Use awk to process the file and remove rows containing the search string. awk -v s="$search_string" -F',' '{ for(i=1; i<=NF; i++) { if($i ~ s) { next } } print $0 }' $csv_file > $temp_file # Move the temporary file to the original file. mv $temp_file $csv_file #+end_src I had to manually fix one row in the file in the =PRICE-DETAILS= column. I did it manually because to save time. #+begin_src lisp :session (lisp-stat:defdf *sr-manchester* (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc.csv")) #+end_src #+RESULTS: : # The 1096 entries is a little to big to put into a bar chart. I'm going to skip this and look at the data in just tables and use the data-frame outputs. ** Filter Results for Weekly and Monthly Rent (and Bills Inc.) The data in the CSV file includes both weekly and monthly rent rates. I need to separate them out so I can workout the average rent for the listings on Spare Room. *I'm just going to exclude all listing which don't include bills in the rent rate.* There are too many missing variables to work out and the data will vary quite a bit. The most obvious influence on differing rent rates is Council Tax bands. I don't have enough time to work out on each listing and work out an (badly formed) estimate. #+begin_src lisp :session :results silent (lisp-stat:write-csv (lisp-stat:filter-rows *sr-manchester* '(and (string= "pw" frequency) (< 100 price) (str:contains? "Bills inc." price-details :ignore-case t))) #P"working-data/2024-02-23-spare-room-manc-weekly.csv" :add-first-row t) #+end_src The ~(< 100 price)~ filter is to stop the average from getting skewed. I saw some listings at £1 and use the description to explain the (most likely convoluted) offer. This is a bit like the car park listings in the Right Move listings data. #+begin_src lisp :session :results silent (lisp-stat:write-csv (lisp-stat:filter-rows *sr-manchester* '(and (string= "pcm" frequency) (< 100 price) (str:contains? "Bills inc." price-details :ignore-case t))) #P"working-data/2024-02-23-spare-room-manc-monthly.csv" :add-first-row t) #+end_src If you want to look at the data from within this file, run the code block below. The file can clog up a fair bit of the file, and lead to excessive scrolling. So, I'd prefer to delete the data after looking at it before I commit any changes made in this file. #+begin_src shell :results output drawer csvlook working-data/2024-02-23-spare-room-manc-weekly.csv #+end_src #+begin_src lisp :session (lisp-stat:defdf *sr-manc-wk* (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-weekly.csv")) #+end_src #+RESULTS: : # #+begin_src lisp :session (lisp-stat:defdf *sr-manc-mth* (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-monthly.csv")) #+end_src #+RESULTS: : # Total number of listings being used, #+begin_src lisp :results output raw (let* ((wk (lisp-stat:mean *sr-manc-wk*:price)) (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price))) (mth (lisp-stat:mean *sr-manc-mth*:price)) (diff (- mth wk-to-mth))) (format t "|Avg. Weekly Listings (£)|Weekly to Monthly (£) |Avg. Monthly Listings (£)|Difference (£)|~%") (format t "|-|-|-|-|~%") (format t "|~f|~f|~f|~f|~%" wk wk-to-mth mth diff)) #+end_src #+RESULTS: | Avg. Weekly Listings (£) | Weekly to Monthly (£) | Avg. Monthly Listings (£) | Difference (£) | |--------------------------+-----------------------+---------------------------+----------------| | 145.44775 | 581.791 | 596.6906 | 14.899553 | I've hard coded the rent prices into the chart below because I was having trouble getting the values from the data-frames into something either the ~vega:defplot~ or ~vega:write-html~ functions understand. I think it has something to do with the values, usually, taking the form of a ~ratio~ (i.e. 234/345) when you call ~(lisp-stat:mean *sr-manc-wk*:price)~ (same when using ~*sr-manc-mth*:price~). Getting these types of values into something resembling a /normal/ number value is something I just can't figure out here. So, I'm just going to take advantage of the tiny amount of data needed and input it manually. #+begin_src lisp :results file (let* ((values (lisp-stat:plist-df `(:x-axis #("Weekly (Scaled up to Monthly)" "Monthly") :y-axis #(581.79 596.69))))) (vega:defplot weekly-monthly-mean `(:title "Difference in Rent Mean Between Monthly & Weekly Listings on Spare Room (23/02/2024)" :mark :bar :width 600 :height 600 :data ,values :encoding (:x (:field :x-axis :title "Payment Frequency" :type :nominal :axis ("labelAngle" 0)) :y (:field :y-axis :title "Rent Mean (£)" :type :quantitative) :tooltip (:field :y-axis)))) (vega:write-html weekly-monthly-mean "renders/2024-02-23-spare-room-average-rent-diffs.html")) #+end_src #+RESULTS: [[file:renders/2024-02-23-spare-room-average-rent-diffs.html]] [[file:renders/2024-02-23-spare-room-average-rent-diffs.png]] * Summary of Spare Room Data (Not Inc. Income Tax) #+begin_src lisp :results raw (lisp-stat:summarize-column '*sr-manc-mth*:price) #+end_src #+RESULTS: 584 reals, min=300, q25=547.6666, q50=599.8028, q75=657.125, max=1260 #+begin_src lisp :results raw (lisp-stat:summarize-column '*sr-manc-wk*:price) #+end_src #+RESULTS: 63 reals, min=110, q25=141.7857, q50=153, q75=162.75, max=167 #+begin_src lisp :results output raw (format t "- Total Listings: ~a~%" (+ 584 63)) (format t "- Week Min. Adjusted to Month: ~a~%" (* 110 4)) (format t "- Week Max. Adjusted to Month: ~a" (* 167 4)) #+end_src #+RESULTS: - Total Listings: 647 - Week Min. Adjusted to Month: 440 - Week Max. Adjusted to Month: 668 | Total Listing Used | Min. (£) | Max. (£) | |--------------------+----------+----------| | 647 | 300 | 1260 | The average rent (including bills) between the weekly and monthly listings (23/02/2024), #+begin_src lisp :session :results output (let* ((wk (lisp-stat:mean *sr-manc-wk*:price)) (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price))) (mth (lisp-stat:mean *sr-manc-mth*:price)) (avg (/ (+ (float wk-to-mth) (float mth)) 2))) (format t "Monthly Average: £~a~%" avg) (format t "Annual Average: £~a" (* avg 12))) #+end_src #+RESULTS: : Monthly Average: £589.24084 : Annual Average: £7070.89 So, I need to earn around £7,100/yr to cover my living costs. This does not include things like food, travel, clothing and socialising. Realistically, I'm probably looking at £13,000/yr to keep me going, at a basic level. #+begin_src lisp :results output raw (let* ((ann (float (+ 7100 5000))) (mth (float (/ ann 12))) ;; Hardcoded rent price because it's quicker and easier to do that here. (wk (float (/ (- mth 589.24) 4))) (day (float (/ wk 7)))) (format t "|Time Span|Value (£)|~%") (format t "|-|-|~%") (format t "|Annual|£~a|~%" ann) (format t "|Monthly|£~a|~%" mth) (format t "|Weekly (After Rent)|£~a|~%" wk) (format t "|Daily (After Rent)|£~a|~%" day)) #+end_src #+RESULTS: | Time Span | Value (£) | |---------------------+------------| | Annual | £12100.0 | | Monthly | £1008.3333 | | Weekly (After Rent) | £104.77333 | | Daily (After Rent) | £14.967619 | At £14.96/day, this is 'bottom of the barrel' type of stuff. Having said that, I should emphasise the rent calculated here is *the average*. There are rent rates which are lower. Unfortunately, I don't feel comfortable thinking I have a buffer zone here. From here on out, I'm going to treat these values as the lowest I can go. I could do with going over the [[file:./uk-wage-tax.org][minimum wage and tax stuff]] for the UK, to see how far off these numbers are. * Recalulate Figures with Income Tax Having gone through the number in [[file:./ideal-flatmate-manchester.org][Ideal Flatmate (Manc.)]], I realised I haven't taken into account the Income Tax with the figures above. I realise the difference won't be that big because the annual salary target, used above, barely passes the Personal Allowance band (Income Tax). #+begin_src lisp :results output raw (let* ((earning-target 12100) (p-allow 12570) (taxable-income (- p-allow earning-target)) (income-after-tax (* taxable-income 0.2)) (total (- earning-target income-after-tax))) (format t "- Annual Target Salary: £~a~%" earning-target) (format t "- Part of Salary which is Taxable: £~a~%" taxable-income) (format t "- After Tax Deducted: £~a~%" income-after-tax) (format t "- Salary After Tax: £~a~%" total)) #+end_src #+RESULTS: - Annual Target Salary: £12100 - Part of Salary which is Taxable: £470 - After Tax Deducted: £94.0 - Salary After Tax: £12006.0 | Time Span | Value After Tax (£) | Mean Rent (£) | |-----------------------+---------------------+---------------| | Annually | 12006 | 589.24 | | Monthly (Before Rent) | 1000.5 | | | Monthly (After Rent) | 411.26 | | | Weekly (After Rent) | 102.815 | | | Daily (After Rent) | 14.687857 | | #+TBLFM: @3$2=@-1/12::@4$2=@-1-@-2$+1::@5$2=@-1/4::@6$2=@-1/7 The recalculation, with Income Tax, changes my spend limit to £14.68/day. It's not a massive drop from £14.96, but it's not helping the situation. I should point out here, these figures use the bare minimum salary target (I've set for myself) of £12,100. Realistically, I'm going have to go with the £13,000 I mentioned in section [[Summary of Spare Room Data (Not Inc. Income Tax)][Summary of Spare Room Data (Not Inc. Income Tax)]] above.