Saturday, January 21, 2012

Google AppEngine Full Text Search with Cloud SQL

Introduction

Many Google AppEngine developers have been waiting for the Full Text Search feature, especially coming from Google the biggest search engine on the Web. I was quite happy to see that Google team is working on it as you can check in the Google I/O 2011 session : Full Text Search by Bo Majewski, Ged Ellis . As far as I know the very promising indexing service is not yet available.

In this article I will explain how you can provide some kind of full text search in your application using services available App Engine services.

In my specific use case I do not ask for a lot of feature, I just need to have simple search a string in various attributes of my entities independently of the case, and possible special characters (such as è,é, ... ). I am far of being an expert of Google Datastore API but I did not find any simple way to achieve this directly using the Java API. What I have done to solve this issue is to duplicate a part of my data into the Google Cloud SQL to use the MySQL fulltext search capabilities.

Prerequisites
To achieve the following tasks you need to :

Content

In the following paragraphs I will explain the basics of the integration of Cloud SQL for full text search, but you can, if you want, jump to :




1. Create Articles Entities

Start by creating some simple entities with some attributes for example, an entity name Article, with title and body attributes.


import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;

//...
//...

  Entity article = new Entity("Article");
  article.setProperty("title", "MySQL Tutorial");
  article.setProperty("body", "DBMS stands for DataBase ...");
  datastore.put(article);

  article = new Entity("Article");
  article.setProperty("title", "Datastore Index Selection and Advanced Search");
  article.setProperty("body", "Learn how recent improvements to the query planner ... function in your application");
  datastore.put(article); 

If you look in the Datastore API, or even JDO or JPA you have no simple way to look for all the articles that are related to Triathlon, or Database, or Entities. Google DataStore does not support  clause where with a "OR" between different fields; and I do not want to mention the fact that it is not possible to ignore the text case in a simple way.

This is why we need to have some full text features. Some of you are surely thinking about using Apache Lucene to do the trick, and yes it is possible. You can use for example the GAELucene project : http://code.google.com/p/gaelucene/. I use another approach, may be less advanced in term of "indexing/searching" options but sufficient for my use case:
  • I store the text values on which I want to do some search in Google Cloud SQL and use the Full Text features of MySQL.


2. Create a SQL Table to store Text values (in development environment)

When using Google AppEngine, the Cloud SQL instances are accessed using a specific driver and configuration that we will see later. For now, we are still in development environment, this is where you have to use your local MySQL instance.

In this specific use case we will copy in a table the two fields and add a new unique key based on the entity key.  So here the SQL to create this:

CREATE SCHEMA search_values DEFAULT CHARACTER SET utf8 ;

USE search_values;


CREATE TABLE articles  (
  entity_key varchar(250),
  title text,
  body text,
  PRIMARY KEY RESULTS_PK (entity_key),
  FULLTEXT (title,body)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;


Lines 1 and 3 are here to create the database schema and use it; then the script create a table that will contain a copy of the title and body from the entity. 

3. Configure your development environment 

This section is a short explanation of the Cloud SQL Documentation : Getting Started: Java

  1. Copy the MySQL JDBC driver into your Google App Engine SDK directory, under /lib/impl/. You can download the MySQL JDBC driver here.
  2. In Eclipse, select your Java package.
  3. Click Run > Run Configurations.
  4. Expand the Web Application menu item.
  5. Add the following lines into the VM Arguments pane:
    -Drdbms.server=local
    -Drdbms.driver=com.mysql.jdbc.Driver
    -Drdbms.url=jdbc:mysql://localhost:3306/search_values?user=username&password=password
    
  6. Click the Classpath tab.
  7. Select your project and click Add External JARs...
  8. Navigate to the Google App Engine SDK directory, then lib/impl, and select the JDBC driver JAR file. Click Open. The driver JAR is listed under User Entries.
  9. Click Apply.
Your development environment is now ready to use your local MySQL database. Let's now, use this database.

4. Use your MySQL table and copy the text values from Google Datastore to MySQL Table

Copying the data from Datastore entity to the table is quite easy:

  Connection conn = null;
  try {
   DriverManager.registerDriver(new AppEngineDriver());
   conn = DriverManager.getConnection("jdbc:google:rdbms://[your db instance]/search_values");
   conn.setAutoCommit(false);  
   String statement = "REPLACE INTO articles (entity_key, title, body) VALUES( ? , ? , ? )";
   PreparedStatement stmt = conn.prepareStatement(statement);

   DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
   Query q = new Query("Article");   
   PreparedQuery pq = datastore.prepare(q);

   // loop on each entity and insert the values in the SQL Table
   for (Entity result : pq.asIterable()) {
    stmt.setString(1,  KeyFactory.keyToString(result.getKey())   );
    stmt.setString(2,  result.getProperty("title").toString() );
    stmt.setString(3,  result.getProperty("body").toString() );
    stmt.executeUpdate();
    conn.commit();
   }



  } catch (SQLException e) {
   e.printStackTrace();
  } finally {
   if (conn != null)
    try {
     conn.close();
    } catch (SQLException ignore) {}
  }



Some specials things here, compare to standard Java Web Development:
  • I manage the connection directly in my code (I have not looked yet if I can use datasources/connection pool in the context of Google AppEngine)
  • Line #3: registering the AppEngine driver that is responsible of managing the connection, expecially work in development -local MySQL- or production mode -CloudSQL-.
  • Line #4 : Get the connection. It is interesting to mention that in development the connection URL is grabbed from the environment variable Drdbms.url you have set previously. We will see later how we move this to the cloud. This is the magical part of the AppEngineDriver that manages different connection types Local MySQL or CloudSQL depending of the context
  • After these lines, the code is quite simple :
    • Get all the Articles entities from the datastore and loop
    • "Upsert" the database record (REPLACE INTO syntax)
  • Line #15 is storing the Key of the entity in a safe string using the KeyFactory.keyToString() method.

If you want to test this code just put this lines in a servlet to "sycnhronize" the data from datastore into the MySQL table. Obviously this code is just here for learning propose and should be integrated in a better way in a real application; starting with pushing the data in the database when entities are created/updated (and deleted ;) ). The sample code available from GitHub contains these methods.


5. Implement a search method

The goal is simple return a list of entities returned by a simple search criteria :
  • public Iterable searchEntities(String query)

The logic is here quite simple:
  1. Execute a SQL query
  2. For each result, get the Entity using the Key
  3. Return the list of Entities

 public Iterable searchEntity(String query) {
  List  results = new ArrayList();
  Connection conn = null;
  try {
   DriverManager.registerDriver(new AppEngineDriver());
   conn = DriverManager.getConnection("jdbc:google:rdbms://[your db instance]/search_values");
   String statement = "SELECT entity_key FROM articles WHERE MATCH (title,body) AGAINST (? WITH QUERY EXPANSION);";
   PreparedStatement stmt = conn.prepareStatement(statement);
   stmt.setString(1, query);
   ResultSet rs = stmt.executeQuery();
   while (rs.next()) {
    String keyAsString = rs.getString(1);    
    Entity article = DatastoreServiceFactory.getDatastoreService().get( KeyFactory.stringToKey(keyAsString)  );
    results.add(article);
   }

  } catch (SQLException e) {
   e.printStackTrace();
  } catch (EntityNotFoundException e) {
   e.printStackTrace();
  } finally {
   if (conn != null)
    try {
     conn.close();
    } catch (SQLException ignore) {}
  }
  return results;
 }


In this method, the system connect to the database and then execute a query to search data using any type of SQL/MySQL query. In this exampe I am using the full text function with the "WITH QUERY EXPANSION". You can obviously use any type of SQL queries for example simple LIKE statement if this is enough four your application.

With this approach when I search for :
  • "database" : the method returns all the articles concerning database, mysql, RDBMS independently of the case.
  • "index" " the method returns all the articles talking about indexing/indexes or search.

6.  Deploy to GAE 


Once you have created your application, and activated and configure your CloudSQL instance (here), you can deploy your application and enjoy an easy way of using Full Text Search with GAE.


Conclusion

In this article I explained how you can use Google Cloud SQL to easily supports Full Text Search queries, based on the Full Text support of MySQL.

The code snippets that I have shared in this article are really basic and not ready for real life usage but still a good starting point. For example I have been using this in my application with GAE Queues to manage my indexes on larger volume of data.

As said before, you can test the application online at http://gae-fulltext-search.appspot.com/ and the source code is available on GitHub : https://github.com/tgrall/gae-full-text-search

No comments:

Post a Comment