Search plugin development

Generally you should first test your code on a dev setup (ie locally) before running it in the production (or production-like) environment. Debugging in a production environment is much more complicated in any case…

I had a devel server that used to work, but not anymore. I reinstalled the packages, redis and postgresql are running, and re-created the SSL keys. Still getting the same errors when I go to 127.0.0.1:8000. Are they familiar to you?

127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad request syntax ('\x16\x03\x01\x02\x00\x01\x00\x01\xfc\x03\x030\xcf\x07J\xa7\xb1\x84\xe6-\x92<1Uy\x8f\xab\x04t\xe0\x15\x01dD\xa5\x1c \xe0\x0c\xb5\x97,\xc5 \x0c`\xc8\x1b\x13\xc7\x98\x83D\xd0KD\xd2a\xef3(\xdd\xb4\xbd\xb5\x9bT36\xe3\xb4\x11\x01\x02H_\x00"\xba\xba\x13\x01\x13\x02\x13\x03\xc0+\xc0/\xc0,\xc00\xcc\xa9\xcc\xa8\xc0\x13\xc0\x14\x00\x9c\x00\x9d\x00/\x005\x00')
127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad HTTP/0.9 request type ("\x16\x03\x01\x02\x00\x01\x00\x01\xfc\x03\x03-\xa8\xd2\xa0*\xe7\xbc\x7f5\xc4\xe7\x1eD\x99\x12Oi\xa9\x833)\x11'\xb8hE\xf8\x9a\xd5\xf62\x9e")
--------------------------------------------------------------------------------
Exception happened during processing of request from 
(Exception happened during processing of request from'127.0. 0('127..1', 0.0.1'56228)
Traceback (most recent call last):
, 56226)
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
    self.finish_request(request, client_address)
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
    self.handle()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
    rv = BaseHTTPRequestHandler.handle(self)
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
    self.handle_one_request()
    elif self.parse_request():
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 286, in parse_request
    elif self.parse_request():
    self.send_error(400, "Bad request syntax (%r)" % requestline)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 281, in parse_request
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
    "Bad HTTP/0.9 request type (%r)" % command)
    self.send_response(code, message)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
    self.send_response(code, message)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
    super(QuietWSGIRequestHandler, self).log_request(code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 18: ordinal not in range(128)
    super(QuietWSGIRequestHandler, self).log_request(code, size)
----------------------------------------
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 18: ordinal not in range(128)
----------------------------------------
127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad request syntax ("\x16\x03\x01\x00\xb5\x01\x00\x00\xb1\x03\x03m\x90\xd6\xda\x16!'3;\x03\xd8_\xa1\xf80\xcd\xe6\xe9\xd0\xd9\x055\xe78F\x9e\xfb\xf5Vv\xca\xb5\x00\x00\x1cJJ\xc0+\xc0/\xc0,\xc00\xcc\xa9\xcc\xa8\xc0\x13\xc0\x14\x00\x9c\x00\x9d\x00/\x005\x00")
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 56230)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
    elif self.parse_request():
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 286, in parse_request
    self.send_error(400, "Bad request syntax (%r)" % requestline)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
    self.send_response(code, message)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
    super(QuietWSGIRequestHandler, self).log_request(code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 14: ordinal not in range(128)
----------------------------------------

Thanks,
Jose

Looks like you’re accessing a dev server that’s running in http mode via https.

Hmm, my notes included https… But yes, using http works.

The dev server has some options to use https, or you can put e.g. nginx in front of it (the dev setup docs mention this as an option). But by default it’s http-only since that’s the easiest way to use it in development.

As agreed during yesterday’s meeting, I have created a docker-compose file that sets up the CERN Search microservice alongside Nginx, Postgres, Redis, ElasticSearch and Tika. This should be enough to get us started with the development of the plugin:

In order to run it, you should download the file to the root folder of the cern-search repo. You will also have to generate the test certificates by hand (we could have it in a separate Dockerfile for nginx, though…)

$ sh scripts/gen-cert.sh
$ mkdir nginx/tls
$ mv nginx.crt nginx/tls/tls.crt
$ mv nginx.key nginx/tls/tls.key
$ rm nginx.csr

If OpenSSL complains about the password being too short, just replace pass:x with pass:12345 in gen-cert.sh (I’ll send a PR to fix that upstream).

Then do docker-compose up and you should have your development cluster running.

I managed to log in to Invenio (https://localhost:8080)

(username: test@example.com, password: test1234)

Retrieving records through the REST API results in an error, probably because I haven’t set up the ElasticSearch indices propertly. In any case, it’s a start.

Apache Tika seems to work fine when I connect to it using tika-python:

In [16]: from tika import parser

In [17]: parser.from_file('/tmp/test.docx', serverEndpoint="http://localhost:9998")
Out[17]:
{'content': u'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTEST2\n',
 'metadata': {u'Application-Name': u'LibreOffice/5.3.6.1$Linux_X86_64 LibreOffice_project/30$Build-1',
...
2 Likes

@pferreir I would appreciate, if you can provide a bit more information about docker (I never used docker before apart from the initial test to run the HelloWorld and list the docker images…).
I installed docker (version: 1.13.1, API version 1.26) and docker-compose (version: 1.24.0)
I created a directory with the downloaded docker-compose.yml file you created and tried to run docker-compose up but this required the Dockerfile. What should the Dockerfile contain? And obviously I am missing the commands to initialize docker and the “development cluster”.
Also, you are using gen-cert.ch to create the certificates. What is the content of this file? a simple openssh command?

I think this answers your question :wink:

In order to run it, you should download the file to the root folder of the cern-search repo.

The repo to clone is GitHub - inveniosoftware-contrib/citadel-search: Citadel: Enterprise Search - it includes the Dockerfile and get-cert.sh script.

THANK YOU! Yes it does.

Hi,

quick question: does the plugin search_invenio actually works? Has it seen working?

As you know, I am trying to write a new plugin based on that one, trying to reuse as much as possible. But, was it functional?

For example, are all templates correct?

At this point, until we have CERN Search deployed, I am trying to mock an EventEntry() object as fake output of a query. I thought, naively, that if I build that object properly, I should see its content in the web page. But I am getting “Build Errors”.

  • It could be (hopefully !!) I am not creating the object correctly.
  • Or it could be the template interpolation is failing.

This is what makes me wonder if the code and architecture of search_invenio is correct…


Speaking of classes in entries.py, are they documented somewhere?
The input options are:

  • result_id
  • title
  • location
  • start_date
  • materials
  • authors
  • description

Unclear to me the type of some of them (strings, integers, …) and their meaning. Where can I find some documentation?

replying myself…
I have just been reminded the invenio plugin does not work.
So the idea then is to interpolate the HTML templates directly with the JSON from the queries, correct?

It always helps to have the correct network configuration…
I followed @pferreir instructions and it was really simple to have docker running.
The following are all the commands I used on my RHEL 7 VM:

$ yum -y install docker
$ pip install docker-compose
$ systemctl start docker
$ git clone https://github.com/inveniosoftware-contrib/cern-search
$ cd cern-search
$ wget https://gist.githubusercontent.com/pferreir/77ede49adb292879c52e3e4a02e28582/raw/c26b7031a2fd5d01c9ac82293300b785d54dd7c9/docker-compose.yml
$ sh scripts/gen-cert.ch
$ mv nginx.crt nginx/tls/tls.crt
$ mv nginx.key nginx/tls/tls.key
$ rm nginx.csr
$ docker-compose up -d

Then, I was able to access Invenio from my desktop (https://web4604.fnal.gov:8080) and used username test@example.com -password test1234

As far as tika, I was able to access it from another server without a problem:

>>> from tika import parser
>>> parser.from_file("./penelope.py", serverEndpoint="http://web4604.fnal.gov:9998")
{'status': 200, 'content': u'\n\n\n\n\n\n\n\n\nfrom __future__ import unicode_literals\
.......

The next steps will be to access the cern-search-api: send indico livesync data to be indexed and then send search requests and receive the search results.
I assume that the example at http://cernsearchdocs.web.cern.ch/cernsearchdocs/example/ and the rest of the rest of the documentation should be our starting point.

1 Like

Just a small suggestions: Do not use pip install docker-compose - it installs TONS of dependencies, and when used outside a virtualenv it leaves behind a huge mess of python packages in your system python environment.

Better download a single-file bundle from https://github.com/docker/compose/releases: https://github.com/docker/compose/releases/download/1.24.0/docker-compose-Linux-x86_64, save it as /usr/local/bin/docker-compose and chmod +x it

@ThiefMaster Thank you for the information. Yes, I did notice all the packages it installs but I followed the instruction as I was not sure of what is needed.

Yes, that’s the idea. I wouldn’t spend tons of time with the interface, however. We will have someone working on a fancy UI on our side, this summer. So, a simple Google-like thing would be enough for now.

OK.
Would you then recommend me to adapt search_cern plugin https://github.com/indico/indico-plugins-cern/tree/master/search_cern ?
I was not planning on changing neither the interface nor the rendering. My plan was just to change the plugin to handle the new JSON output from the queries to “CERN Search” and let the existing code to do the rest. Right?
So, if that sounds like a reasonable approach, then I guess the steps here are:

  1. find the exact method where the queries are performed. In search_invenio plugin was _fetch_data(). I need to find out where exactly is done in search_cern
  2. find out where the output of the query is being used to fill the HTML templates.
  3. massage, if needed, the JSON output to be able to fill the HTML templates with it. The templates in search_cern are supposed to be correct, I assume…

Sounds correct to you?

The search_cern plugin is not a great example because it uses an <iframe> to show the results. So, it does absolutely no rendering of any results, it just displays the page that is sent by the search engine (Sharepoint in this case). So, yes, you can adapt it, but then you’ll have to write a very basic interface. You can actually just “steal” it from the old Invenio plugin: https://github.com/indico/indico-plugins/blob/master/search_invenio/indico_search_invenio/templates/results.html

find the exact method where the queries are performed. In search_invenio plugin was _fetch_data(). I need to find out where exactly is done in search_cern

You probably want https://github.com/indico/indico-plugins-cern/blob/master/search_cern/indico_search_cern/engine.py#L28.

find out where the output of the query is being used to fill the HTML templates.

It’s not. But you can steal that from the Invenio plugin as I’ve said.

massage, if needed, the JSON output to be able to fill the HTML templates with it.

Yes!

Oh. Then, my original approach was not that bad after all.
I was studying the invenio plugin. I more or less got the general idea

invenio

In this case, the output is converted (or tried to) to objects Author(), EventEntry(), ContributionEntry(), and SubContributionEntry(). After that, they are supposed to be used to fill the HTML templates. If I got the logic correctly.
Therefore, I was working assuming that, if you create those Entry() objects properly, the rendering would work.
From your answer I get the templates in invenio are correct.
So I assume the idea is to reuse

  • code from search_cern, as much as possible
  • templates from search_invenio

Did I get your comments correctly?

Yes, that’s what I mean, and it should be possible.

The Elasticsearch (CERN search) marshmallow schema is almost finished (I placed a draft at: Elasticsearch_Docs/schemas.py at master · penelopec/Elasticsearch_Docs · GitHub)

There are only the following fields that I am not able to get any values and I am not able to find information about these fields:

ContributionSchema:

  • creation_date = mm.DateTime(attribute='created_dt')

SubContributionSchema:

  • creation_date = mm.DateTime(attribute='created_dt')
  • start_date = mm.DateTime(attribute='start_dt')
  • end_date = mm.DateTime(attribute='end_dt')

For the implementation I made the following assumptions:
ACL assumptions (for the read entry of _access):

  • For public access the ACL will contain only one entry 'ANONYMOUS' or it could be just empty, depending on what Pablo expects.
  • For private access will contain the users’ ID and the users’ email.
  • The ACL for subcontributions is that of the contribution it belongs to
  • The ACL of the EventNote is that of the object it belongs to (contribution, session or event)
  • For all mappings I have added a URL field to contain the external url for accessing the object from the search results.

The following are the questions that I have in order to move forward in the livesync_json (I decided that this is a better name for this plugin):

  1. How do I access the CERN search app (assuming that I have installed in docker what Pedro has supplied)?

  2. How should I call the CERN search app for populating ES?
    For the ES I have the following request.post line:
    response = requests.post(self.url, auth=(self.username, self.password), data={'json': jsondata})
    where the jsondata is a string that complies with the form appropriate for the BULK api of ES
    (Bulk API | Elasticsearch Guide [8.11] | Elastic).
    I am using only the index and delete operations, and the _index is the mapping that I am accessing and _id is the object’s id:

POST _bulk

{ "index" : {"_index" : "events",  "_id" : 1 } }\n
{ "field1" : "value1" }\n
{ "delete" : { "_index" : "notes", "_id" : 2 } }\n
  1. What information should the web setup page of the plugin contain?
  • tika server URL
  • CERN search app URL
  • Access username / password for ES(?)
  • ??