by Vas Mylko, Roman Bilusyak
Curiosio is undergoing Beta4 — we are adapting and scaling the search technology to the size and richness of the United States. The country is huge, even without Alaska. The number of populated places is ~20,000. The number of POIs that we recognized is ~40,000. Big cities usually have many POIs. A small town could be a POI itself.
American POIs
We made two visualizations: where POIs are and where curiosity value is. The first rendering shows the geographical distribution and concentration of POIs.
The second rendering shows the geographical distribution and concentration of curiosity value. The map looks quite different from the first.
By default, Curiosio will try to route you through the most interesting places. Reminder: you could always specify any waypoints and Curiosio will route you through them, fulfilling the rest of the trip with “default” interesting places.
American Cities
American cities are metal if you know what subreddit /r/natureismetal is. It is very difficult [for us] to bind crawled POIs by cities and towns. First of all, there are many types of populated places in the U.S., including grey areas like CDP (aka Census-Designated Place) and Unincorporated Communities. Populated places are included into each other like matroska. Second, the shapes of those matroskas are so weird, even with good known cities it is difficult to assign POIs by cities. Third, there are errors in coordinates. There are many such and other edge cases. BTW if you like weird geography — there is our former eponymous post — Weird Geography.
American Size
The U.S. is big, each state is indeed the state — kind of a country with own history and legacy. The true size of the United States is better seen in comparison to Canada and the whole of Western Europe. Curiosio Beta3 was made to search for multi-point road trips in big countries Australia, Canada, India. Beta4 will be the next level — many cities, towns, places, and a big area. We have finally built a version of the U.S. knowledge graph, sufficient for the quality and critical mass for diversity.
American Data
Despite we have crawled more than 100,000 POIs for the U.S., the list was reduced 2.5x because of their absence on English Wikipedia or they had no coordinates there. Right now we operate with ~40,000 POIs from the populated places and nature, all with multiple attributes. We use CVS Fingerprints by Setosa.io for high-level preview of the data, for uniformity, structure, and select interactive validations by a human. Below is an example of the data view. It is a table of 7 columns and 40K rows. The data in each column should be of a specific type.
The knowledge graph is our secondary technology, we need to have it to make our primary technology work — the multi-objective search engine on top of the knowledge graph. We have recently described in Frequently Asked Questions where we take the data from and how we use ML to classify the POI data. There are multiple POI types; they were created by significantly reducing GeoNames feature codes. Below are some raw visualizations from the lab — analysis of distributions of the most curious POIs by types.
Beta4 is not only about size. It is also about improving the quality of ML. We experimented with different classifiers, decided to stick to SVM because it showed very good results — quickly and not power hungry at all. However, the quality was not good enough on the American scale. There are several tricky POI types such as novelty, technofetish, that are difficult to classify even for humans.
There are 40+ POI classes and it would require a massive amount of manual labeling, to improve overall quality by using the same approach that worked well before. That was a moment we decided to apply the Active Learning technique. We got inspired by Andy Bosyi during AI & BigData conference in Lviv, where we also delivered. We implemented own Active Learning framework and rose prediction quality from 80% to 95% on American data.
As soon as the search for road trips in the U.S. works fast enough, we will release the U.S. to the public. You will be able to plan curious multi-point road trips in the U.S., and optimize your journeys by time and money. Stay curious.