Supertrip Technical

Curiosio
21 min readMay 22, 2023

by Vas Mylko, Roman Bilusiak

Ingeenee inside the Library inside the Web

It’s a technical post about the creation of Curiosio v20 aka Supertrip. Geeks shall find something interesting everywhere across the story— it’s real stuff about the real working thing. Making a new Web Search Engine for Travel from scratch is quite a scope. Linking Computational Intelligence to it was a 10x harder challenge. All in all the effort was dedicated to making the Curiosio AI smarter. Curiosio is powered by our AI engine Ingeenee.

Artificial Smartness

Artificial Intelligence is not the key. The machine could have intelligence but be not smart. This is related to humans — high IQ scores but not smart because other quotients below the bar — EQ (Emotional Quotient), BQ (Body Quotient), HQ (Health), FQ (Financial), SQ (Social), etc. We, humans, need Artificial Smartness from machines. There are various quotients in the Travel domain. Travel AI engine Ingeenee needs Web to grow up smart.

© M.C. Escher, Swans, 1956

To become smart the AI must be put inside the domain and grow up in the dynamically changing domain. AI is a method, Smartness is a result. Learning is a feedback loop. There are two feedback loops in Curiosio: Humans-AI-Humans and AI-Humans-AI. Both loops are interrelated and never-ending. Both loops are unbounded — they are spirals, positive spirals, positive-sum game spirals. Let’s dive deeper into each AI loop for details.

Humans-AI-Humans Loop

The circuit is this: trip description/story/guide by humans on the web → sizing the travel scope’s duration & budget by AI → editing & tuning by humans. The web data acts as a seed, a context to start off, for inspiration. The sizing of the travel story/guide is way more advanced than time & money. A traveler could add or remove points & places. It includes geography — a traveler can set a custom perimeter she is interested in. Travel themes, car options, number of people — all those parameters could be optimized simultaneously. The final cut is done by the traveler at the end when a decision is made. AI is learning from humans by helping humans.

Machine Learning is possible on the domain data from the web. Could be used for seeding the Ingeenee for computation — Supertrip function. Could be used for machine-level decision-making about human likability for different trip plans. It’s Machine Empathy. Humans do not necessarily like the most optimal trip plan in their requirements; inversion: the most loved trip plan in the requirements could be not optimal. Humans-AI-Humans loop solves this problem.

AI-Humans-AI Loop

The looping happens in this circuit: AI discovers a high concentration of travel experiences and offers them to humans → travelers prefer some trip plans over others, like, edit, tune them → AI continues multi-factor optimization within the human-defined requirements in the narrow context. Humans are learning from AI by helping AI.

When a “lazy” traveler initiates the creation of the trip in a wide/open context then AI takes responsibility for what to propose. The traveler could learn a lot of new for herself because the information is unique — created by computational intelligence, by Ingeenee — the unique travel information did not exist before, the travel knowledge has been created on demand. The traveler is learning from AI during interacting with AI. Searching and researching the options until settled or tired. Then, it repeats.

What we want instead of intelligence is artificial smartness. — Kevin Kelly

Thus, two related never-ending loops are running simultaneously, expanding into spirals, producing new information and travel knowledge that has never existed before — creating artificial smartness for Travel.

Crawling

We knew in advance what web sources we wanted to crawl. Almost every website where we found interesting travel stories inspired us to create Signature Trips on top of them. Signature Trips allow shrinking or stretching the duration, making the trip cheaper or juicier, sizing for different numbers of travelers, and setting a custom perimeter for the geographical area of interest. Among those names: National Geographic, Condé Nast Traveler, Travel+Leisure, Frommer’s, Lonely Planet, Thrillist.

Usually, websites have sitemaps, and rules of use. It starts from the robots.txt file. Another file called sitemap.xml or similar to that is usually present in the robots.txt file. Its purpose is to describe the website’s content in a web crawler-friendly manner. To prevent web spiders there is a way to put links and timestamps for creation or change/update. Sitemap is a URL inclusion/exclusion protocol complementing robots.txt.

Sitemaps in canonical format could be automated. Trafilatura gives tools to process sitemaps from CLI. All standard Linux tools like cat, grep, sort are also available for the links processing, together with pipelining it all. There are specialized tools for link filtering (validation, discrimination, normalization, cleansing, etc.) We were all set with standard Bash tools and throw-away Python scripting to get what we needed from the links.

Some sitemaps are deeply nested and complicated, e.g. Moon. Some websites don’t follow the crawler-friendly protocol but still provide a sitemap in a custom form, e.g. PlanetWare. Many travel blogs on Blogger and WordPress don’t have sitemaps. Some sources are on their own, e.g. Reddit. Where are their sitemaps? How to access them? More crawling problems to be solved and more solutions to be designed.

Parsing

On the Web, text is the primary carrier of the information. Extracting information from the web pages is still difficult. One size does not fit all. Coding dedicated adapters for each web source is long and expensive.

The main purpose of web extraction is the isolation of the real stuff from the fluff. There is a main article and tons of boilerplate around it on the web page. (Walled web is out of scope.) If stripping off headers and footers seems straightforward then sidebars bring new problems. Embedded ad blocks in the middle of the main article are a problem. Embedded teasers in the middle of the article to the other content at the same web source is a problem. Infinite scroll is a problem; infinite web pages when an article follows an article infinitely is a problem.

Theoretically, isolation of the main content is working — just switch your favorite web browser to the Read/Reader View, and the boilerplate is gone. This [Reader] tool must be available outside the web browser as a library, as a module of the browser, especially for open-source browsers like Firefox, and Chromium. Here harsh reality kicks in.

Readability did not work well enough on the test set of web pages. A deeper analysis revealed that a browser itself cannot grasp certain types of web pages with its native readability implementation. Firefox Reader View shows everything except the main article for this web page from Discovery. Another concern was a security warning for the underlying Readability.js. So we looked to the original Boilerpipe (Python port), modern and more robust Trafilatura, and several others (Goose Python port, html2text, etc.)

Testing per web source is needed to decide which parser to apply. Testing within the web source is needed too because 10 years old articles on the same website will be formatted differently than 5 years old, and the recent articles will be different. When 3rd party libs don’t work then a custom code must be written. It’s hardcore parsing with BeautifulSoup. Often looking for the <article> tag or equivalents, to get to the main node in the DOM model. So it’s official — web parsing is still complicated in 2023.

It is absolutely OK to approach web parsing from the basics and classics, as Toby Segaran described in the Programming Collective Intelligence 15 years ago. BTW the book is #1 is the recommended reading for those whom we will hire when we hire.

There is good news too. Some sources contain geo coordinates. Getting the geo metadata directly helps a lot. It merges and simplifies two tasks — Parsing and Scribing. Wikivoyage is such a web source. Some Wikivoyage itineraries have interactive maps with pins. The drawback is the total number of itineraries ~400, and many of them are local walking/hiking or very long by train. Despite the low number of itineraries we are going to parse Wikivoyage again to improve the accuracy of our web documents.

Scalability is possible but questionable. How to parse travel blogs on top of WordPress without sitemaps? Which travel blogs to look at? There are good ones but there is a big number of low-quality, even garbage. Scaling from 100,000 docs to 1,000,000 should better happen with high-quality docs.

We will listen to the users who travel blogs to add to the index. We will look ourselves which travel blogs are credible and potentially useful for what travelers want. A signal for us could be an article that inspired the creation of a Signature Trip, e.g. Bella from London, an award-winning documentary director, photographer, writer, and her Turkish road trip. Here is a Supertrip on top of her story — Road Trip on the Aegean Coast.

Scribing

When we create a Signature Trip inspired by a Supertraveler we walk through the text and write down the names of cities, towns, villages, parks, mountains, eateries, and POIs manually. The idea of automatic transcribing from text was percolating for a while. There were ideas to transcribe travel-related meta-information from YouTube videos — as audio to text, then text to the soup of toponyms. Instagram photos and stories are also considered.

Text is hard enough if top accuracy is expected, but the text is good enough to grasp the high-level context, and then our own computational intelligence kicks in. If the web parsing was good then the scribing was expected to be good too. The majority of the noise is coming from imperfect parsing and isolation of the main text.

Some noise is inevitable when the main story tells something like: “We loved this sunset so much, it reminded us of the one seen from the Deception Pass between Whidbey and Fidalgo Islands“. If the traveler was writing about the Brooklyn Bridge then the soup’s span will become from New York City to Seattle then. Not firm, not relevant, not smart.

The side effect of not reaching the top accuracy is that Curiosio is creating brand new trips in similar contexts; Curiosio does not repeat or reuse the original content. If some points or places are missing during auto-scribing then Curiosio could propose them on its own because it knows them. Our goal is to scribe all points and places that are present in our Knowledge Graph. We are OK to omit some eateries that did not exist a year ago and may not exist a year from now.

Extracting all toponyms present in the free-form text is not an easy task. It might have become easier with OpenAI GPT but it is expensive with it. The corpus of docs, it’s estimated to be tens of thousands of US Dollars. We evaluated the API for GPT-3.5. Only text-davinci-003worked well enough, while all simpler and cheaper curie, babbage, ada could not crack the names better than the method we applied in pre-ChatGPT times.

We have not benchmarked GPT-4 yet. It’s expected to outperform GPT-3.5. Could be useful for long multi-word names with diacritics, for long multi-word canonical names with the county and the state name included, for the names with abbreviations of the administrative regions, for the use of special symbols for enclosing administrative regions, etc.

Automatic transcribing of the toponym candidates was conceptually described and visualized already in the previous post — Curiosio Supertrip, scroll to the Genesis section there. We used BERT and spaCy for geo-named entity recognition aka NER, in a special configuration to reach the sweet spot that worked on our corpus of docs good enough. It would have been useful to compare our quality with Mordecai but a few attempts to install and run failed. Code review revealed it uses spaCy and doesn’t use BERT, but uses a custom geospatial algorithm for the matching in context, and a special neural net trained for that.

Better toponym identification could be unlocked with new LLMs. There is a cool experiment from Digital Digging. Large Language Models could help extract, find, and enhance location data — in the context. Text-focused machines can do digging for data to extract locations from text blobs; chase coordinates by finding geolocation of addresses; finding additional information to enrich the data. We must be able to talk to travel articles. We must be able to ask questions about the story:

  • What is a story about? pure travel or something else (food, airline updates, music, science, archaeology, pilgrimage, escape, etc)?
  • If the story is describing a trip then what country the trip takes place in? Or multiple countries?
  • What mode of transportation was used? Driving, flying, railway, multi-modal transportation?
  • What was the intention or mention of each toponym? Was it visited on this trip [like Brooklyn Bridge] or just mentioned to compare the experience [like the Deception Pass]?

Without all those details we do not know if toponyms are relevant to the trip or not. Otherwise, there are many false positives that spoil the future data crunching down the pipeline. The result of automatic transcribing from the text is a soup of toponyms or toponym candidates.

Geoparsing

The main next goal is to convert a soup of toponyms to geo soup — resolve each name to the geo-coordinates and to the canonical name. Each trip takes place in some geography-geometry. The additional task of geo-fencing the noise pops up. To filter extracted lists of toponyms we need to know where in the world the point or place is located so we can drop ones outside of the geographical scope, outside of the rational perimeter, and outside of the potential route.

First off we want to know the country the trip belongs to. If it is a single-country trip then geoparsing becomes simpler. For multi-country trips, the complexity is an order of magnitude harder. We look for the toponyms present in the URL and title as usually, they contain such names, though in free form. The canonical form could be present, especially for destination trips.

Then we match toponyms against our Knowledge Graph (containing points and places with geo coordinates for 24 countries) and GeoNames. While our KG is still growing with new countries (the next three are going to be Spain, Japan, and Greece) the GeoNames gazetteer is very much stable and big (containing over 25,000,000 geographical names corresponding to over 11,800,000 unique features.) This matching filters out most of the false positives.

On the other hand, there are names that match many countries simultaneously — like Paris and London which are present in the United States. Boston is present in the United Kingdom. There are many Springfields and other very common names. So in the end, we have a geosoup of points but still with some points not relevant to the story. It’s so weird with those tomonyms-homonyms.

Weird Geography

Toponymic homonyms are hard to geoparse. We know what we are talking about, we have been dealing with the weird geography for a while. The U.S. is American weird. Weird is beautiful when it is a POI. Weird is hard and annoying when it’s ambiguous geoparsing. The “worst” top three in descending order are Russia, China, Iran. We are not planning to do at least two of them because they are bombing Ukraine killing Ukrainians.

Let’s start from the top of the alphabet. Abbeville is present in the soup of points and places for a web document. How to resolve it, what’s the right and true Abbeville must be? English Wikipedia lists 4 Abbevilles in France, one of them pure name; 3 in Ireland; 6 in the United States; 1 in the United Kingdom.

Geoparser may tell it’s “Abbeville” and “Georgia” as two different toponyms instead of “Abbeville, Georgia” a single multi-word toponym. From those toponyms, we may infer French Abbeville and Georgia country. The context is paramount, and Natural Language Understanding aka NLU is vital. Knowing the country the document is describing helps but isn’t always sufficient, context is the key.

British place names are heavily duplicated around the world. There are 55 Richmonds, 46 Londons, 41 Oxfords, 36 Manchesters, 35 Bristols. In the U.S. only, there are 91 Washingtons, 45 Franklins, 39 Clintons, 38 Arlingtons, 38 Centervilles, 35 Lebanons, 35 Georgetowns, 35 Springfields, 32 Chesters, 32 Fairviews, 31 Greenvilles, 29 Bristols, 28 Daytons, 28 Dovers.

New LLMs should help to geoparse smarter — multi-word toponyms with disambiguation in the context. Inversion: new LLMs should help to not pull the noise around the toponyms, i.e. single-word toponyms could be better.

Clustering

Here goes the final step in the data pipeline — grouping and splitting the toponyms into clusters. A cluster corresponds to the automatic perimeter of the trip. There are trips within an area, there are long one-way trips. Their routes have totally different geometry. Figuring out how to split the geosoups to make them useful for trip planning. This is the most difficult part here. It’s difficult to grasp where the cluster is even for a human eye.

We tried many approaches for density-based geographical data. Starting from DBSCAN and HDBSCAN to Robust Single Linkage and coding own algorithms. Each algorithm had its own disadvantages in our context. The main issue is that all those algorithms do not handle a low (< 10) number of points well. DBSCAN does not follow long one-way routes. HDBSCAN does follow long one-way routes but is very sensitive to the low number of points and the linkage distance.

The best approach that worked for us was a combination of our own algorithm and an Elliptic Envelope to detect outliers and keep only relevant points. The Elliptic Envelope has to be configured on a case-by-case basis as there is no one-size-fits-all config. The Elliptic Envelope is sensitive to point distribution and that has to be tuned accordingly, e.g. per country.

Example of Anomaly aka Outlier Detection in the U.S.

A web document could have multiple clusters of points and you hit [Supertrip] then a random uniform choice is made which cluster to take to auto-populate the Create Trip form for you. So when you find a travel article directly from Curiosio and click [Supertrip] multiple times on the same article then it’s OK to see different points & places filled in the blanks.

A usual case for multiple clusters is a Reddit post with comments. One traveler is asking for options. Other Redditors-travelers are responding with so many alternatives that together they are all a soup of points. How the original post author could make a decision from that soup? Curiosio automatically cuts the points into portions and suggests a cluster at a time.

Trip topology is defined based on the cluster shape. If the cluster forms a long worm then trip topology is One-way. If the cluster contains only one point then the topology is Destination aka Flower. In other cases, it’s a Round trip. It is possible that Round and Destination configurations for the same geosoup of points are present simultaneously.

Corpus

The corpus of travel documents consists of 100,000 processed web pages from three dozen web sources. We save only metadata: keywords, phrases, toponyms, country/countries, clustered points, etc. A document consists of publicly available Open Graph data; custom distilled metadata such as keywords, key phrases, and summary; custom-built data such as clusters, perimeter, and topology.

We used multiple extractive summarizes because some of them were prone to summarize the noise instead of the real stuff. We wanted to use only abstractive summarization to distance ourselves from the original data from the very beginning. Abstractive summarization sucked so badly that we stopped using it. (With GPT-3.5+ things changed so we could try again.)

Data features reveal that the data from certain sources looks similar. One distinctive cohort is Frommer’s, Lonely Planet, Moon, The Culture Trip. Another cohort is Matador Network, Thrillist, AFAR, Geographical, National Geographic, Australian Geographic. Inversion — the data is so different between Condé Nast Traveler and travel blogs on Medium. Different parameters or even different algorithms could be applied per cohort for higher efficiency and quality.

~40% of docs in the corpus do not describe a travel story directly but describe food experiences or social or historic or archaeological context that could become a theme for travel. Many docs have a soup of toponyms but they are mostly false positives parsed from embedded advertisement blocks or teasers. (Parsing could be improved first, then Scribing could be improved.)

With another ~60% we had a challenge with detecting which country the docs belong to. It’s because the soup of toponyms contains false positives (toponyms from different countries irrelevant to the described trip). Our algorithm is not 100% accurate and we still have about 3–5% of error rate where we detect the wrong country, and as a result, we match all toponyms to ones present in that country. It’s a toponymic homonyms paradise if so.

These ~100k web pages cover 237 countries (we support only 24 of them so far). We extracted ~50k unique keywords and ~500k key phrases from those web pages. This meta-data unlocks information for us as we analyze how people [describe] travel and plan to improve our AI accordingly. This meta-data unlocks Netflix of Trips as we got the axis to rotate the data in multiple dimensions.

Index

Documents are loaded into Elasticsearch to create Index. We decided to use Elasticsearch because it allows us to index our documents without any preliminary tuning. Even default mapping works well with Full-Text Search (though we applied some cleansing to the documents to avoid or postpone custom mapping). Loading of the entire corpus takes an hour.

Elasticsearch uses the so-called Query DSL to write search requests. It’s JSON based. It requires some learning to understand the structure, style, and idioms. It is needed to find relevant documents to provide user contexts, such as selected countries and search queries. Search for a “scenic trip” returns different documents depending on the selected country. That is done by boosting document scores by country. Also, we boost document quality to sort resulting documents and show the best on top.

We are running Elasticsearch on the Docker cluster to simplify the deployment. It’s a simple one-click setup then. However, we systematically experience some instability with this Elasticsearch configuration. The main issue is a slowdown in the response time for the same search requests. It could be a 10x time difference, and there is inconsistency. Elasticsearch could crash and it does crash sometimes. We did not investigate the root cause yet and are not sure if that was an issue within Elastic or Docker or both of them.

Supertripping

We are continuing with the article you already know from the Curiosio v20 post. Let’s make trip plans based on that article from National Geographic. Go to https://curiosio.com, select Italy, scroll down, and find Tuscany Breathtaking Countryside. You should be able to find it even in two words Tuscany Breathtaking. Click it in the Find results, and hit the red [Supertrip] button to automatically pre-fill the Create Trip form with the points & places from the article.

Sometimes it’s always the same list of points and the topology of the trip. Sometimes it’s a different topology for the same list of points & places, e.g. Round or Destination. Sometimes it’s different lists of points. For the Tuscany case, it’s always the same list of points and Round trip.

User Bob. Bob is planning a trip with her fiancee, he is flexible, he is exploring the options. Bob selected the 2 Travelers and Rental Car option. He left all other fields by default. Bob got 7 trip plans in less than a minute. Here are three of them: Plan1, Plan2, Plan3. As Bob did not set a duration he got options for 3, 4, 5 days. Below are screenshots of those three trip plans together with the Create Trip form with requirements.

User Alice. Alice is planning a trip with her family. She wants to start and finish in Florence, she is strict with the duration (dates) of one week. She has preferences for the total budget of this trip to be within $2500. So Alice edits the pre-filled Create Trip form to start & finish in Florence, moves Buonconvento to waypoints, and gets different trip plans than Bob: Plan1, Plan2, Plan3. There are more alternatives, e.g. one or a couple of points could be omitted for the sake of sneaking to San Marino or Venice.

All trip plans are 7 days long because Alice set the duration. With longer duration, it’s becoming possible to drive to more points not mentioned explicitly as travel-through. To prevent from driving too far a custom Perimeter could be set. Then, Curiosio will create the routes within the perimeter. All points & places in Itinerary and on the route are clickable to open the corresponding pages at Wikipedia for detailed descriptions.

Sketch a Trip

This is an experimental feature working only in the Lab. If you find your favorite travel article outside of Curiosio then you could take the text block you are interested in and paste it in Curiosio for processing. You could copy/paste any text, for example from email or from chat. There are no requirements for formatting, so grab a blob of text describing a trip and put it in Curiosio. Curiosio will do the rest to give you relevant trip plan options build up on top of the text you provided.

Curiosio is taking the Japan Romantic Road in plain text as a trip description

The email thread is the first step from plain text to something else. The comments thread on Reddit or travel forum is definitely not prose text anymore. A thread from short comments looks more like a chat, ain’t it true? And there are chats — group chats, peer chats — where you discuss your travel plans.

Article-to-Trip is the first step to Supertrip from the text usually containing travel information. There are cases when the articles from travel web sources are empty or about something else. Text-to-Trip is a second step to Supertrip from any text that potentially contains travel information. Chat-to-Trip is a third step to doing travel planning interactively. Voice-to-Trip is a fourth step when typing is replaced by talking and listening. We will experiment with this more and decide what to deploy publicly and when.

Testing

Officially Curiosio is certified only for Chrome browsers from Google. We test only Chrome browsers and only a few recent versions of Chrome. There is no time to test more front-ends because we are building the flagship travel tech, hence the priority is there. As soon as the product is sufficient enough then we will pay more attention to other browsers. Or we will think of making mobile apps of Curiosio — for iPhones and Droids.

Casual testing for Safari isn’t possible because Safari installs only on top of iOS which installs only on top of Apple hardware. Our iPhones and iPads are too old to update software, we are on Droids for a while. We stopped using mobile labs because they charged us when we didn’t expect to be charged (unfair charge). Only users tell us from time to time what’s wrong with Safari, then we look. Though we do casual testing in Firefox because we are using Firefox and mobile DuckDuckGo, and sometimes Chromium.

The Curiosio UX is a very thin web layer on top of the Ingeenee computational intelligence. Theoretically, such simple WUI must work well everywhere [by design]. We are going to keep the UX that way because nobody knows how our next initiative with Digital Concierge could unfold.

UX and AI

Affective Computing is coming closer and closer. With LLMs, it feels like Emptional Intelligence is emerging. ChatGPT is talking to you as a kid when you are talking to it as a kid. ChatGPT is talking to you as an expert if you are talking to it as an expert. If you talk as a scientist then ChatGPT will talk to you as a scientist. That’s the top trait that people don’t notice but feel and embrace the intelligent tool.

There are other LLMs appearing here and there as mushrooms after the rain every week. AI is becoming smarter and people/travelers becoming happier because of getting useful tools for their travel endeavors. Here is a diagram of LLM's role in UX and the Computational Intelligence role behind the UX. Ingeenee AI engine is a vantablack vertical monolith on the right, together with Wolfram’s red rhombic hexecontahedron.

User Experience and Artificial Intelligence

The things that don’t change so much are our planet and the points & places worth visiting to experience the different sides of the Earth. The lifespan is not changing much despite longevity efforts. Free time for vacations or at least for not working is not growing many years over year.

There is only one planet, one life, and only two resources required for visiting all places before you die — time and money. Curiosio helps to get the maximum from what you have and what you want. You are following your curiosity, Curiosio is doing the magic. If you got down here by reading this long post but had never tried Curiosio before then navigate your desktop Chrome browser to https://curiosio.com immediately.

#StandWithUkraine

We kindly ask you to push and encourage your governments to send more and bigger weaponry to Ukraine to help defeat rashists. Donations to Come Back Alive or Prytula Foundation are helpful — these funds are doing vital military procurement at scale from stocks around the world.

Curiosio is being designed and built in Ukraine. Our country is under attack from Ruzzia. The full-scale invasion is undergoing since the 24th of Feb, 2022. The war began in Feb 2014. The enemy is brutal, dishonorable, and atrocious. Look at the image bank for the evidence, watch the ruzzian war crimes in pictures and numbers, and read the witness stories.

--

--