Upgrade Endeca Indexing Method – Why and How?

Upgrading Endeca Indexing design from the old forge process to the new forge-less model, helps business align with the product roadmap with an improved indexing strategy. One of the major design changeofficially introduced in Oracle Endeca v11.1 is the elimination of forge from the process of indexing and uses content acquisition system instead. Oracle recommends following this approach going forward for all existing and new customers while implementing Oracle Endeca Guided Search. To understand the new recommended way of indexing, it is important to know how the conventional indexing process called the ITL functioned.

endeca_diagram1

— The data ingestion workflow starts from the source data and ends in the search index. The raw data from multiple sources is fed to a process called Forge which then transforms all the source data to a standard format, tags the data with dimensions (facets) and delivers the normalized tagged data.

— The Forge process can only import structured data. To extract structured information from unstructured data sources such as file system, content management system etc, a software called Content Acquisition System(CAS) is used.

— Finally, the output from forge is used by the indexing process called Dgidx. Dgidx creates a set of search indices from forge output and loads it on to the Endeca search engine called the MDEX engine.

endeca_diagram2

 

The core of the conventional indexing model is the Forge process. A data design tool called ‘Developer Studio’ is used to create configuration and process flow for the forge process and the configuration itself is called the “Pipeline”.

Why upgrade to new indexing process?

The new data ingest design introduced in 11.1 version aims at completely eliminating Forge from the indexing process by using Content Acquisition System to perform tasks that were previously accomplished by Forge. The followings are several reasons or benefits over the decision on making this change in design,

— Forge is a single threaded process, hence processing large datasets can be time-consuming at times. There were several instances where using multiple parallel forge pipelines was the only option to save indexing time but it made the pipeline design more complicated.

— Forge is a 32-bit process and it can only pull data for transforming from structured data sources but it cannot extract data from documents or websites like Content Acquisition System (CAS) does. Content Acquisition System, on the other hand, is a 64-bit multithreaded server, which can handle multiple requests if required.

— Content Acquisition System is designed to handle both full and incremental data crawling. In other words, you can choose to crawl the data source completely from the scratch or only a part of data that has changed since the previous crawl.

— Data manipulation can be done with Content Acquisition System.

How does it work?

endeca_diagram3

— In the new process, Forge is replaced by Content Acquisition System(CAS) to load data and dimensions (facets). The process of loading and processing data is called crawling. When a CAS crawl is run, the data is processed and standardized based on the crawl configuration provided. The output generated by each CAS crawl is stored in a web service called Record store.

— The record store instances generated by CAS are then merged by “Record Store Merger”. The Record store merger itself is a CAS crawl but it is only intended to join records from all record store instances and produce an output that can be read by the indexing process “Dgidx”. Apart from joining record store instances, it also pulls index configuration, processes dimensions, precedence rules and writes it to the dgidx compatible output.

— Finally, Dgidx process picks up the record store merger output and creates search indices that are stored in the Endeca MDEX engine.

Note that the only thing that is missing from the Forge is the ability to join multiple record store instances. Record store merger can only perform a switch join, which is a union of records and cannot achieve a left or right join, combining the source records into one record. If at all such a join is required, between data from multiple sources, it should be accomplished externally before loading data into record stores.

To migrate from the old forge based approach to the forge-less approach, a proper analysis of the impacts of the data design and loading process must be done. There are undocumented pitfalls where architectural changes could drastically increase development time and therefore, it is very vital to partner with the right people to get the migration done in a short span of time. With our expertise in Oracle Endeca Commerce, we could help you re-platform, redesign and achieve a seamless move to the new data ingestion model without affecting any existing functionality. If you are in need of any strategic advice on migrating Endeca to the latest version, drop us a line and one of our consultants would be happy to help you.