by Vas Mylko, Roman Bilusiak
This is a technical post on how we are designing Curiosio v20. We would prefer coding and debugging to writing but it’s too dark to code nowadays.
Product Concept
What if you could start planning a journey based on the movie you watched? Or the book you read? From a YouTube vlog? From National Geographic or Discovery documentary? And from a blog post or travelogue, of course. Even from a thread of comments on a social network suggesting controversial best picks.
This is going to become possible with Curiosio. We are going to make it as easy as a single click. You are finding what you like and — with a help of a magical red button — you are planning what you want on top of it. We call this button Supertrip because you are creating your trip from some foundational trip. Single-click will give you the context of the trip, then you can augment and change the points and places, dates and budget, people and cars according to your requirements.
Technical Concept
We have to index everything that is a potential source for the trip context. First off, index text-based sources: guidebooks, travelogues, travel blogs, travel mags, and travel channels on social networks. Text sources could be explicit travel articles/guides or implicit. An implicit source of the travel context could be a book, e.g. Wild memoir by Cheryl Strayed.
It is worth mentioning that even simple web pages are not so much text. The text must be extracted from them. Extracting the main article content from the web page from all the other fluff is still a technical problem as of today. More about this is below in the Parsing section.
Then, index video-based sources: movies, vlogs, podcasts, documentaries. Many movies have the corresponding text known as the plot or script. Though the script is not easily accessible the plot is very well present on Wikipedia and other websites. The hardest to “understand” would be videos without speech. Auto-annotation is needed to auto-transcribe them into the text.
After getting the text we could geoparse it. Geo parsing or geo entity recognition is needed to extract the geographical regions, points & places to visit and experience. There is a lot of noise in the travel stories. This is natural to have such kind of wording: “Like in London’s Tate I enjoyed new exhibitions in MoMA in New York”. How to understand that the point is New York City, not the state of New York, and the place is MoMA, not Tate? More on geo NER below in Geoparsing section.
Finally, it’s Supertrip — Curiosio allows you to create supertrips right away — you just enter the known or wanted points & places manually. Supertrip API and web UI is already built.
High-Level Architecture
How to index all travel stories, travel magazines, books, movies, and social tubes? Do we need some equivalent of PageRank? Not at all. Since the websites started to maintain sitemaps it is not necessary to code/run a web spider that discovers the links. It’s sufficient to run a web crawler to visit the links from the known lists. Recognition/classification of the visited documents can produce static ranks. Relevance of the search results to the user’s query can produce dynamic (real-time) ranks. A niche search engine doesn’t require a PageRank-like algorithm to start off.
Some machine “understanding” of the documents is needed. It’s doable via metadata extraction/creation. Saving original documents is not legit except for the initial and temporary fetching of the document. Saving the metadata and text properties is legit. Semantic data to be extracted is a caption, a very short summary, keywords, key phrases. Geospatial data is countries, regions, toponyms. Technical data is size, date-time, data types, etc. Quality classification could be built on top of those text pieces.
Then those documents are loaded into the Apache Lucene indexer. Lucene was originally written by Doug Cutting when he saw Google’s paper. Elasticsearch engine was later built on top of the Lucene library. Elasticsearch is convenient for schema-free JSON documents, with good full-text search, and with proven high availability. You will be able to find tons of travel and travel-friendly articles like you do googling. When you hit the red [Supertrip] button then you are in the context of a trip creation related to the article.
Context recognition is going to be hard. Recognition of multiple possible contexts could be easier. As Curiosio is not designed to duplicate any travel guides the recreation of the identical context is not needed at all. The overlap to a certain degree is appreciated as a foundation, then it’s up to you and up to the AI that empowers Curiosio.
After the context is recognized you could get candidate alternatives of the possible journeys — geo perimeter, starting/finishing or destination points, the topology of the route, duration, and estimated budget. All this data is thrown on the Create Trip form, you can do your edits there, and create the interactive trip plans in those requirements.
Additional technical sections below give some more details on the problems and challenges experienced during real hacking in the Lab or predicted during thought experiments.
Crawling
Instead of programming or configuring a spider, it’s rational to directly read from the sitemaps of respected web sources such as National Geographic or Conde Nast Traveler. There is a sitemap protocol and tools to read the sitemap using the protocol, e.g. Trafilatura. It has a command line interface, mixable with Bash commands such as grep, uniq, etc.
Some sitemaps are not standard, requiring custom parsing. Some sitemaps are gigantic and very hierarchical, also requiring custom parsing. Some websites don’t have sitemaps at all, hence some kind of mechanization is needed to collect the list of links. For some sites, it is not possible to do anything but use a spider. It is unclear how to crawl all personal blogs running on top of WordPress — there are no names of the blogs known a priori, and there are no sitemaps [so often].
Crawling is relatively easy but very long. The crawler must respect the website policy for the number of hits per time interval. Sometimes robots.txt is present with the allowed rate, sometimes it’s absent.
Parsing
Parsing is relatively quick but difficult. The problem of extraction of the main content (article) from the web page is not solved yet. It seems like solved when you use the browser’s reader view though. Firefox and Safari browsers’ reader view is based on Readability, and Chrome browser’s reader view is partially based on Boilerpipe.
At least Python Readability and Boilerpipe aren’t good enough on the web pages we process. There were many article extraction tools created over time. There is no silver bullet that could process any web page at an acceptable quality level, hence we have to use the best of them, and use each in the operating window where it works well.
On some kinds of web pages, Boilerpipe is good, but when it isn’t good then it is terrible — e.g. it could miss the entire main text or take only a small chunk of it. On other kinds of web pages, Trafilatura is good, but often it doesn’t extract the main content in full. All other tools from the comparison list, including Readability, performed worse than the two mentioned explicitly. There is no browser reader view available for the pages like Reddit /r/roadtrip threaded comments.
Custom parsing could be done with BeautifulSoup with a robust and lenient html5lib parser. Though the good old Boilerpipe is still attempted for improvements by 15% the Computer Vision approach could be the best universal one. Parsing is hard but feasible.
Geoparsing
Having a main article from the web page the paramount task is to extract toponyms. This problem is known as Geo NER or Geoparsing. It is still a big problem because it’s language context-sensitive, and geospatial context-sensitive. There is so much ambiguity out there — 91 Washingtons, 45 Franklins, 39 Clintons only in the United States. So we looked to the ready-made solutions to the geoparsing problem.
One candidate solution was Mordecai: “Full-text geoparsing as a Python library. Extract the place names from a piece of English-language text, resolve them to the correct place, and return their coordinates and structured geographic information.” Funded by DARPA, the U.S. Army Research Laboratory, the U.S. Army Research Office through the Minerva Initiative and the National Science Foundation. There are ~700 stars and ~100 forks on GitHub.
Unfortunately, we were not able to run and test it as we got many errors during the installation. Hence we looked into the code to figure out if it is worth trying further or building our own solution that would theoretically outperform it. It uses spaCy for the named entity recognition. Then it resolves them using a Geonames gazetteer with some custom logic to get coordinates. Elasticsearch geospatial queries are used for that.
As we experimented with geoparsing for a while (and even worked with academia on the geoparsing project) we already know some things. First of all — spaCy is not sufficient for NER on our corpus of docs. It’s OK to use spaCy but it’s worth using other tools. BERT NER is a newer one, it’s a Transformer. Would Mordecai work if fed toponyms from BERT? Another detail is the use of Geonames gazetteer only. Other gazetteers and encyclopedias worth to be plugged into: OSM, Wikidata.
Though Geo NER is included in NER training data, the focused Geo NER should work better than NER. If somebody does a big task of geo-labeling big data using IOB or other formats solely for geoparsing purposes it will be a global good. Current language models could be retrained on a better data set and perform better at toponym extraction.
Text Extraction
Straightforward text data is Open Graph. Page title, associated picture, canonical URL, and other fields are read by known names. The majority of websites use it (though some don’t, e.g. Frommer’s).
The next piece is a summary. Good abstractive summarization is preferred over extractive because it produces new content, hence the volume of the content could be bigger. Extractive summarization works better — more reliably than abstractive— but must be rigidly limited to respect the copyrighted data. It was noticed that abstractive summarizers hallucinate — producing garbage and duplicating the same garbage over and over in the same output. We have not tried Open AI yet but can confirm that Google’s PEGASUS does hallucinate on Wikivoyage text.
Keywords and key phrases extraction aren’t easy. When it works then it works fine. But when it doesn’t work it produces torn words and sometimes even garbage. On some web sources, photo attributions are often recognized as keywords. Extractive summarization is also vulnerable to picking up photo attributions as top content.
Naming & Cache’ing
Below is a slightly edited famous quote about the eternal high-level problems in programming, independent of bits, operating systems, programming languages. It very well resonates with Curiosio vision.
There are only two hard things in programming: cache invalidation and naming things. — Phil Karlton
Assuming we will index 100,000 pure travel and travel-friendly articles, and define the travel contexts — how to auto-name the trips? We could ask a user — you — to name each trip you are planning… but auto-naming before asking would definitely be useful.
We will try the auto-naming in a future version codenamed “Travel Story” where we will attempt NLG (Natural Language Generation) to produce a custom travel guide at the level of Lonely Planet. Joke! Not possible to make it that good but we will try to approach high quality.
By knowing each good context we could assume you would need a slightly shorter or longer trip duration, slightly cheaper or juicier experience, slightly bigger or smaller perimeter, and so forth. Not all crawled stories will have the practical context, but many of the stories will have multiple contexts, so the scale is estimable at 100,000–1,000,000. Why not compute in advance and cache? The system response time to your inquiries would be minimal then.
Machine Learning
Text sources of the travel ideas or travelogues exist in natural language. It means nothing is structured. There are semi-structured package tours though within a website that is selling them — but we don’t touch that type of data — because Curiosio is about personal experience, hence nothing packaged is personal by design. Though pure crawling for the meta-data from those itineraries on the public web should be very much legit.
We are thinking about structuring all travel documents into the same format. Ain’t cool to convert any Tripadvisor story or any Travel+Leisure article or any blog post into the same format? Comparison is possible then. Analytics, intelligence, and Machine Learning are possible then. We already experimented with the trip schema. Maybe we will nudge the Tourism Structured Web Data Community Group to make some new stuff.
Assuming we computed it all and created trip plans for all travel contexts in the world in the same format— it unlocks big data magic. We will be able to identify travel secrets nobody in the industry knows. This intelligence could be used to stage a better experience for travelers, right from the planning phase as well as supporting en route and in-destination.
Planning & Re-planning
For the Curiosio AI engine there is no difference between planning a trip two weeks in advance or two days ahead, or spending two days on the route and re-planning according to the new requirements. There are two resources needed for travel — time and money.
What happens during a trip? Time is passing, money is being spent, and you are moving from point to point visiting places, and having experiences. Also, you are eating, resting, and staying overnight. At any moment of the journey, it is possible to identify the visited area and reduce the remaining perimeter, possible to count days, and track the budget spending. Hence, it’s possible to re-plan under “new” requirements from now till the end of the journey.
You could add more points or places, you could remove initially planned points or places. You could do fine-tuning such as staying longer at some point, or being at a point on the sharp date. For the Curiosio algorithms, it’s all equivalent to the planning before the trip.
Computing and adjusting trip plan on the go is a step to Concierge service. The Concierge must also take the burden of booking, re-booking, refunding, insuring, etc. routines off the traveler.