FAQ - Cyc

Frequently Asked Questions

There is a great deal of confusion out there about AI. And it’s not your fault if you’re confused. Everyone who can spell ‘AI’ is selling it, and the conceptual space has gotten very muddy. On this page, we’ll do our best to clear things up by responding to the questions our potential and actual clients frequently ask us. While we believe these answers will help, we also know there is no substitute for talking with a person; make sure to request a demo.

What differentiates the Cyc ontology?

The Cyc knowledge base (KB) is composed of some 25 million assertions. When you combine this with the generality of the knowledge and the efficient inference engines that can leverage this knowledge to generate new conclusions, Cyc easily has trillions of pieces of usable knowledge. When you add to this the knowledge that Cyc possesses by accessing external databases (akin to how you might say you know the phone numbers stored on the contact list in your phone, even if you could not recite them by memory alone), Cyc’s KB size is clearly differentiated from other AI platforms.

Second, the Cyc ontology is expressed in a higher order logic. While the details of this are important and interesting to logicians, the upshot can be clearly seen in contrast with triplestores (RDF stores). Triplestores are so-called because they take three arguments: subject, object, and a relation between them. Triplestores are often represented graphically, with two nodes connected by some directed relational arrow. This is useful for saying things like the following:

Casey works as an engineer.

The triplestore can then relate the object <Casey> to the object <engineer> by the <works as a> relation. However, English sentences, and the propositions that they represent, are often much more complicated than these two-place relations can handle. Consider:

Casey believes that Lara had coffee with breakfast.

The latter part, “Lara had coffee with breakfast”, is amenable to a triplestore, but nesting that sentence inside a “Casey believes” component is going to be very complicated to represent in such a framework. On the other hand, Cyc’s language, CycL, is expressive enough that you can say anything in CycL that you can say in English. In the present context, the differentiator is that CycL allows for arbitrarily high-arity relations. Instead of being stuck with relations between two objects, you can relate arbitrarily many things. Take another example:

Wearing the red shirt rather than a green one caused the bull to charge rather than ignore him.

In CycL you can express this with something like:

(#$causes-Contrastive <WearingARedShirt> <WearingAGreenShirt> <BullCharges> <BullDoesNotCharge>)

In that case, we have a relation with four relata. Such expressivity allows for flexible representations that do not have to simplify and lose information when knowledge is encoded in Cyc. For a more in-depth discussion, see our Technology Overview.

Third, the Cyc knowledge base utilizes something called “microtheories” to contextualize knowledge. For instance, we could assert in the #$TheSimpsonsMt (the microtheory for knowledge about the Simpsons) that Bart is a male fourth-grader. But in another context, #$RealWorldDataMt (the knowledge about the real world), we can assert that Bart is a cartoon character. This means that even though Cyc knows that cartoon characters cannot be real persons, we can put ourselves in the context of the cartoon when appropriate. This contextualization is not just useful for fictional contexts. Consider:

There are many different legal contexts: e.g. you should drive on different sides of the road in the United States versus England.
Newtonian and Quantum Physics are inconsistent, but it is often very useful to act as if one or the other is the right model to use.
In personal belief contexts microtheories are very useful: we can build a microtheory that contains all and only the beliefs held by a given agent to see what would be reasonable for that agent to conclude.

This approach to contextual knowledge has allowed Cycorp to build a massive knowledge base without worries of violating global consistency: we only need to maintain consistency within contexts.

Lastly, Cycorp has a distinguished history of working with a wide variety of clients, ranging from government defense agencies to private companies in the health, energy, and financial sectors. This means that Cyc does not have siloed information that is only relevant to a particular domain or sub-domain. Rather, this knowledge is asserted generally, and has been proven in applications ranging from taxes to chemistry, from engineering to natural language understanding.

How does knowledge get added to Cyc?

Cyc already has more than 90% of the knowledge that our new clients will ultimately use, but we do need to add pieces of domain knowledge and proprietary client information to have robust and customized applications. In this FAQ, we will quickly review the four primary methods by which knowledge is added to the knowledge base.

1. Knowledge is hand-crafted by an ontologist.

Cycorp is staffed by Ontological Engineers, typically philosophy Ph.Ds who are trained in making careful distinctions, converting natural language to higher order logic, and translating this into CycL, the epistemological language for Cyc. This method is used for most projects, especially early on. Ontologists are also available to deal with especially tricky cases after a given product is deployed. However, Cycorp ultimately wants to hand the reins over when a product is deployed, empowering customers to use other methods for adding to the knowledge base as necessary.

2. Knowledge is pulled in from structured or semi-structured data.

Certain documents, databases, and other sources may contain knowledge that is useful and structured enough to scrape in to the knowledge base. This is less common than the following.

3. Knowledge is accessed by connecting to, and/or modifying, a database.

One of Cyc’s greatest strengths is its ability to act as an interlingua between disparate data sources. We can connect to your data where it lives natively, and understand that data just as a new employee would. You simply need to tell Cyc how to connect to the data source, how it is structured, and what the data means. Adding data sources is a simple process that we have developed specialized tools for. In this way, clients can easily augment the knowledge base by adding data sources, or modify it by changing the contents of existing data sources.

4. Knowledge is entered using specialized knowledge addition tools.

As mentioned in (1) above, Cycorp aims to provide software solutions rather than long term professional services. To this end, we generate customized tools that enable those with little to no Cyc training to expand the knowledge base. We have a number of these tools available for demonstration upon request.

What Does Cyc know about <topic>?

Cyc gets its name from “Encyclopedia”, and it has an enormous knowledge base that models the real world (as well as some fictional ones).

As such, it makes sense when considering whether to use Cyc to ask what it knows about your particular domain. For instance, folks in the healthcare industry ask us what Cyc knows about hospitals, medical procedures, and insurance. We can answer this question, and we do, but it turns out this isn’t the best question to ask.

Problem 1: Burying the Lede

The first problem with “what does Cyc know about <topic>?” is that it obscures the more fundamental question: “how long will it take to teach Cyc all that is necessary for my application?” These two questions are indeed related: if Cyc already knew everything about your domain, then no time would be needed to add knowledge. However, because of the expressiveness of the language, the generality at which things are asserted, and the tools available to ontologize concepts, adding knowledge to Cyc is an efficient process.

The generality point bears further explanation. Take some claim like “All horses have heads.” While this is true, we would not outright assert this in Cyc. Instead, it is better to teach Cyc one piece of reusable information, such as “All mammals have heads.” For another example, we can teach Cyc that whenever you pinch a fluid conduit while some fluid is flowing, pressure builds upstream and decreases downstream. This single piece of knowledge can be applied to straws, veins, hoses, and standpipes in oil drilling operations. Because the knowledge base is filled with such reusable bits, adding new domain knowledge is often very quick: a few assertions can provide hooks to leverage a great deal of already present knowledge.

One way to get insight into this is to find something Cyc doesn’t know about in your domain and then have us teach Cyc about this concept and report how long it took. The sample size of a POC/POV/Phase Zero of a project is sufficient to demonstrate scalability. However, we are also happy to take a very small sample by simply adding some concept that Cyc previously did not know about and transparently recording how long the knowledge addition took.

Problem 2: Measuring Virtues

The second problem with “what does Cyc know about X?” is that it focuses on one virtue–having lots of knowledge. There are many virtues: having few falsehoods, being able to reason quickly over a knowledge base, having knowledge with high utility, having knowledge that is internally consistent, etc. Some of these virtues can conflict: the more that one knows, often the more difficult it is to draw out all of the implications of the knowledge.

So, how do you know when knowledge of some domain is complete (enough)? Should we prioritize one of the virtues over others? Internally, we attempt to balance these virtues with test-based development. This means that prior to beginning a particular task, we lay out a series of things in plain English that we expect Cyc to be able to conclude. For example, if we are ontologizing the rules of the road, we may want to ensure that Cyc can answer at least the following questions:

In what direction should you turn your wheels when parking facing uphill without a curb?
On a one-way street, what color is the broken lane marker?
What color is a yield sign?
What shape is a stop sign?

These questions may have different answers in different contexts: not every country has the same set of road signs, for instance. We then teach Cyc about the necessary underlying concepts to answer these questions correctly in any context. However, we don’t just “teach to the test”: as discussed above, we teach Cyc things at a general level that the test is supposed to be a representative sample of. Once we have done this and all of the tests are passing, we create new tests to see whether our coverage was general enough. Ideally, we ask folks who are not on the existing project to come up with questions that they would expect Cyc to know if it had mastery over the given domain. Sometimes there are third party tests that serve as a good basis for evaluating knowledge. In the case of driving, we might use review materials for state driver’s licensing tests.

To be clear, even though we do not solely value quantity of knowledge, Cyc still does quite well by that measure. The knowledge base contains over 25 million assertions, and our inference engines enable us to efficiently conclude trillions of bits of knowledge.

Problem 3: Knowledge Versus Data

At a first pass, knowledge involves general, reusable truths about types of things, whereas data involves specific claims about individuals. A few examples of knowledge:

A birthday is the calendar date when an animal was born. Humans often celebrate the anniversary of birth with parties.
Stocks can be bought or sold in specialized markets called stock markets.

Contrast with some similar examples of data:

Casey Hart’s birthday is August 2, 1986.
The stock price of Amazon as of March 28 at 7:30 AM was $1,765.70.

Cycorp is in the business of knowledge. Data is cheap and ubiquitous: we can call out to Wikidata or other external sources to find such individual facts. This is not to say that data is not important, just that one shouldn’t evaluate the quality of our knowledge base by reference to whether Cyc knows various bits of trivia. To compare: given that your cell phone can store all of the phone numbers on your contact list, we would not expect you to have all of your friends’ and family’s numbers memorized. To the contrary, it might be considered a waste of intellectual resources to remember Aunt Kathy’s number when the task is so easily and cheaply farmed out to your contact list.

This is not to say that there isn’t any data in Cyc. There are some pieces of trivia that are referenced frequently enough that storing them directly in Cyc rather than needing to call out to a database is more efficient. To return to the analogy, you probably have a few frequently used phone numbers memorized, even if you could look them up on your contact list as well.

Conclusion

Cyc probably knows quite a bit about whatever domain you are interested in, but we should be careful not to focus on the wrong question. Instead, we make sure we 1) target time to solution rather than the current state of the knowledge base, 2) focus on successful inference rather than just quantity of knowledge, and 3) appreciate the difference between knowledge and data.

How is Cyc different from Machine Learning and other AI?

This is a broad question. On the one hand, we are distinguished by our people (and executives), products, and philosophy. But those answers are found elsewhere on this site, so instead we will focus in this FAQ on how Cyc compares to ML AI solutions.

Symbolic Reasoning versus Machine Learning

Cyc leverages symbolic reasoning rather than machine learning (ML). Symbolic reasoners were dubbed Good Old Fashioned Artificial Intelligence (GOFAI) by John Haugeland. In short, ML approaches were pioneered in the late ’50s and ’60s and allow computers to ‘learn’ by training over large data sets. In contrast, GOFAI systems start with a logical representation of knowledge and then search and perform inference over this knowledge to come to conclusions.

GOFAI and ML have different strengths and weaknesses. ML shines when there are large, representative data sets. This is perfect for tasks like determining what movies Netflix should recommend to users. GOFAI shines when outputs can and must be explained, and when there is value to be gained in re-using the representations. This is why Cycorp has targeted areas with high stakes and regulation: energy, healthcare, and fintech.

While we distinguish our approach from ML-centric AI, we do not want to disparage ML. To the contrary, we firmly believe that solving artificial general intelligence will require a solution with both ML and GOFAI components. Many of our clients utilize Cyc to fill in the gaps left by ML solutions. While ML can harness massive datasets and computing power to find novel solutions, Cyc can serve as a check against those systems. Cyc can apply its understanding of the world to sanity check ML outputs, or to work from first principles in areas with sparse or bad data that defies ML approaches.

How Does Semantic Knowledge Source Integration (SKSI) work?

One of the central values of an ontology is the ability to add meaning to your data. Suppose you look at a spreadsheet which has columns for “car_make”, “car_model”, “year”, and “base_price”, you might naively think this table contains all the information that you need to identify the base price of a 2017 Nissan Versa (provided there is a row with those values in it). However, this is not quite right. What you need is not entirely in the spreadsheet: strictly speaking the sheet only contains certain strings and numbers (“car_make”, “Nissan”, “2017”, etc.) that require our understanding of what these things mean in a broader context. When humans use data like this, we are our own interpretation engines that link information in the database to their meanings. If we want computers to leverage our data in the way that we do, we need AI that can serve as a similar interpretation engine.

SKSI

Cycorp connects the Knowledge Base to data sources by a capability we call Semantic Knowledge Source Integration (SKSI). In this FAQ, we will give a quick but approachable description for how this works. First, we will discuss why we utilize SKSI and what its benefits are. Then, we will lay out the basic architecture for how we represent data sources in order for Cyc to leverage them where they natively live.

The Problems

Data Without Meaning

As the introduction to this page brought out, spreadsheets do not wear their meanings on their sleeves. Rather, they relate a variety of strings and numbers (among a few other data types depending on format). These relations are tremendously powerful, but only when the data can be properly interpreted.

Above you saw an example where the field names mapped very closely to their meanings. But often we have field names that are non-obvious: a financial database may have “ma_cpg”. What does this mean? It is not at all obvious unless someone tells you that it refers to the minimum average cents per gallon cost of fuel being referred to on that row.

Cycorp solves this problem by explicitly representing the meaning of your data source and connecting it to the deepest, broadest, and most expressive knowledge base in the world.

Information Silos

You might solve this by creating additional documents that serve to provide the context for a table. For instance, we could create another table, .pdf, or other document that explains to users that “base_price” relates a type of car to a type character string that appears in the data source. But this merely generates more documentation that users have to find in order to piece together the meaning; the data and meaning are still separated.

Contrast this approach with the Cycorp solution: we allow your data and meaning to be processed at the same time by our AI platform. This means that users can simply ask the plain questions they have of the data.

Unnecessary Technical Barriers

If your data is stored in a specialized format, then you require your data analysts to be proficient in that format in order to extract any value from your data. But this is a waste: why should, say, supply chain analysts need to be master SQL programmers in order to look at the data? Data should be accessible to everyone so that your supply chain experts can interact with the data straightforwardly.

Cycorp solves this problem by providing natural language interfaces for users to query their data. Cyc can generate the SPARQL or GREMLIN or other such queries for you. Your experts simply need to be good analysts. And Cyc can help with that part, too!

“Bad” Data

Everyone cringes when outside parties see their data: we are all self-conscious about the problems with our data: it’s gappy, messy, in different formats from one place to another, and so on. This can generate a serious problem for computer systems that take any data inputs as perfectly representative of the world. However, human users are not so easily fooled: we know that some values simply don’t make sense, and we can therefore keep bad data from corrupting our reasoning.

Cycorp solves this problem by first understanding the world. This means that we do not derive our understanding of the world from data, but instead know how the world works and use data when appropriate. Cyc can therefore spot bad or unreliable data and know when to throw out suspicious data, just as you would.

Small Data

In the age of big data, there are still many areas where the data is too sparse to generate a sufficient training base for machine-learning AI technologies. What can AI do in cases where the sample size is only about one oil well, or two hospitals, or the stock prices of five companies? These cases are opaque to a data-only statistical approach, but Cyc’s symbolic reasoning can fall back on first principles to draw meaningful conclusions even when the sample size is one.

Cycorp solves this problem in the same way we deal with bad data. Since we start with an understanding of the world, we can apply general principles to reason about novel cases.

SKSI Architecture

Connection

Cycorp can derive value from data sources without needing to migrate them into some sort of data lake. Instead, we represent the nature of a given knowledge source. For example, if you had some database called SampleDB, we could create a concept for that database in Cyc, call it #$SampleDB. We can then tell Cyc where #$SampleDB lives, providing connection information. This will enable Cyc to access that knowledge source whenever necessary. But before Cyc can meaningfully hit the database, we need to further characterize SampleDB.

Database Structure

The first component of representation is the ‘physical’ information about the source’s structure. In the case of a .csv, this may include noting the number of rows and fields, as well as the datatypes for each field (e.g. string, float). Cyc will then know all of the non-semantic facts about the data source, such as the primary key and all of the field names. What is still missing is the semantics: what does everything mean?

Translation

Having the structure is good, but it requires translation into Cyc’s ontology. This so-called “logical schema” representation of a knowledge source is where we connect the terms in the physical table with the concepts in Cyc. Sometimes the connections are relatively obvious; we might specify that the string “convertible” refers to the term #$ConvertibleCar in Cyc. But, obviously the strings do not need to bear any resemblance to the CycL term that we use to represent the relevant concepts.

Meaning Sentences

The logical schema is also where we can specify the relations that are characterized by linking together the information in various fields. So, when you see a row that contains 1234, 2017, and “Versa” beneath the fields “id_no”, “year”, and “car_model”, you know that car model 1234 is a 2017 Versa. We empower Cyc to draw this conclusion by making explicit the relationship that these fields bear to one another.

Schema Modeling Tool

Cycorp facilitates efficient data connection. First, you only need to map a data source once, and then Cyc will never forget what that source looks like, or how to access it. So, data mapping is a one-time task, returning value for the life of that data source. Second, we make this one-time mapping painless by either providing professional services for the mapping or giving you access to our Schema Modelling Tool (SMT). The SMT is a semi-automated method for generate mappings and meaning sentences, automatically building the physical and logical schemas for your new data sources. Demonstrations of the SMT are available upon request.

Is your question missing from our list? Quickly get a personalized response by contacting us.