Search plugin development

jcaballe · April 26, 2019, 3:28pm

I had a devel server that used to work, but not anymore. I reinstalled the packages, redis and postgresql are running, and re-created the SSL keys. Still getting the same errors when I go to 127.0.0.1:8000. Are they familiar to you?

127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad request syntax ('\x16\x03\x01\x02\x00\x01\x00\x01\xfc\x03\x030\xcf\x07J\xa7\xb1\x84\xe6-\x92<1Uy\x8f\xab\x04t\xe0\x15\x01dD\xa5\x1c \xe0\x0c\xb5\x97,\xc5 \x0c`\xc8\x1b\x13\xc7\x98\x83D\xd0KD\xd2a\xef3(\xdd\xb4\xbd\xb5\x9bT36\xe3\xb4\x11\x01\x02H_\x00"\xba\xba\x13\x01\x13\x02\x13\x03\xc0+\xc0/\xc0,\xc00\xcc\xa9\xcc\xa8\xc0\x13\xc0\x14\x00\x9c\x00\x9d\x00/\x005\x00')
127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad HTTP/0.9 request type ("\x16\x03\x01\x02\x00\x01\x00\x01\xfc\x03\x03-\xa8\xd2\xa0*\xe7\xbc\x7f5\xc4\xe7\x1eD\x99\x12Oi\xa9\x833)\x11'\xb8hE\xf8\x9a\xd5\xf62\x9e")
--------------------------------------------------------------------------------
Exception happened during processing of request from 
(Exception happened during processing of request from'127.0. 0('127..1', 0.0.1'56228)
Traceback (most recent call last):
, 56226)
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
    self.finish_request(request, client_address)
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
    self.handle()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
    rv = BaseHTTPRequestHandler.handle(self)
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
    self.handle_one_request()
    elif self.parse_request():
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 286, in parse_request
    elif self.parse_request():
    self.send_error(400, "Bad request syntax (%r)" % requestline)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 281, in parse_request
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
    "Bad HTTP/0.9 request type (%r)" % command)
    self.send_response(code, message)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
    self.send_response(code, message)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
    super(QuietWSGIRequestHandler, self).log_request(code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 18: ordinal not in range(128)
    super(QuietWSGIRequestHandler, self).log_request(code, size)
----------------------------------------
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 18: ordinal not in range(128)
----------------------------------------
127.0.0.1 - - [26/Apr/2019 11:23:41] code 400, message Bad request syntax ("\x16\x03\x01\x00\xb5\x01\x00\x00\xb1\x03\x03m\x90\xd6\xda\x16!'3;\x03\xd8_\xa1\xf80\xcd\xe6\xe9\xd0\xd9\x055\xe78F\x9e\xfb\xf5Vv\xca\xb5\x00\x00\x1cJJ\xc0+\xc0/\xc0,\xc00\xcc\xa9\xcc\xa8\xc0\x13\xc0\x14\x00\x9c\x00\x9d\x00/\x005\x00")
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 56230)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/SocketServer.py", line 593, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 327, in handle_one_request
    elif self.parse_request():
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 286, in parse_request
    self.send_error(400, "Bad request syntax (%r)" % requestline)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 368, in send_error
    self.send_response(code, message)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 332, in send_response
    self.log_request(code)
  File "/home/fakeusername/dev/indico/src/indico/cli/devserver.py", line 161, in log_request
    super(QuietWSGIRequestHandler, self).log_request(code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 373, in log_request
    self.log('info', '"%s" %s %s', msg, code, size)
  File "/home/fakeusername/dev/indico/env/lib/python2.7/site-packages/werkzeug/serving.py", line 384, in log
    message % args))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 14: ordinal not in range(128)
----------------------------------------

Thanks,
Jose

ThiefMaster · April 26, 2019, 9:28pm

Looks like you’re accessing a dev server that’s running in http mode via https.

jcaballe · April 29, 2019, 5:44pm

Hmm, my notes included https… But yes, using http works.

ThiefMaster · April 29, 2019, 5:47pm

The dev server has some options to use https, or you can put e.g. nginx in front of it (the dev setup docs mention this as an option). But by default it’s http-only since that’s the easiest way to use it in development.

pferreir · May 16, 2019, 8:31am

As agreed during yesterday’s meeting, I have created a docker-compose file that sets up the CERN Search microservice alongside Nginx, Postgres, Redis, ElasticSearch and Tika. This should be enough to get us started with the development of the plugin:

gist.github.com

https://gist.github.com/pferreir/77ede49adb292879c52e3e4a02e28582

docker-compose.yml

version: "3.7"
services:
  cern-search-api:
    build: .
    networks:
      - default
    environment:
      - "INVENIO_ACCOUNTS_SESSION_REDIS_URL=redis://redis:6379/1"
      - "INVENIO_CACHE_REDIS_URL=redis://redis:6379/0"
      - "INVENIO_SEARCH_ELASTIC_HOSTS=elasticsearch"

This file has been truncated. show original

In order to run it, you should download the file to the root folder of the cern-search repo. You will also have to generate the test certificates by hand (we could have it in a separate Dockerfile for nginx, though…)

$ sh scripts/gen-cert.sh
$ mkdir nginx/tls
$ mv nginx.crt nginx/tls/tls.crt
$ mv nginx.key nginx/tls/tls.key
$ rm nginx.csr

If OpenSSL complains about the password being too short, just replace pass:x with pass:12345 in gen-cert.sh (I’ll send a PR to fix that upstream).

Then do docker-compose up and you should have your development cluster running.

I managed to log in to Invenio (https://localhost:8080)

(username: test@example.com, password: test1234)

Retrieving records through the REST API results in an error, probably because I haven’t set up the ElasticSearch indices propertly. In any case, it’s a start.

Apache Tika seems to work fine when I connect to it using tika-python:

In [16]: from tika import parser

In [17]: parser.from_file('/tmp/test.docx', serverEndpoint="http://localhost:9998")
Out[17]:
{'content': u'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTEST2\n',
 'metadata': {u'Application-Name': u'LibreOffice/5.3.6.1$Linux_X86_64 LibreOffice_project/30$Build-1',
...

penelopec · May 20, 2019, 9:27pm

@pferreir I would appreciate, if you can provide a bit more information about docker (I never used docker before apart from the initial test to run the HelloWorld and list the docker images…).
I installed docker (version: 1.13.1, API version 1.26) and docker-compose (version: 1.24.0)
I created a directory with the downloaded docker-compose.yml file you created and tried to run docker-compose up but this required the Dockerfile. What should the Dockerfile contain? And obviously I am missing the commands to initialize docker and the “development cluster”.
Also, you are using gen-cert.ch to create the certificates. What is the content of this file? a simple openssh command?

ThiefMaster · May 21, 2019, 6:21pm

I think this answers your question

In order to run it, you should download the file to the root folder of the cern-search repo.

The repo to clone is GitHub - inveniosoftware-contrib/citadel-search: Citadel: Enterprise Search - it includes the Dockerfile and get-cert.sh script.

penelopec · May 21, 2019, 6:24pm

THANK YOU! Yes it does.

jcaballe · May 29, 2019, 9:16pm

Hi,

quick question: does the plugin search_invenio actually works? Has it seen working?

As you know, I am trying to write a new plugin based on that one, trying to reuse as much as possible. But, was it functional?

For example, are all templates correct?

At this point, until we have CERN Search deployed, I am trying to mock an EventEntry() object as fake output of a query. I thought, naively, that if I build that object properly, I should see its content in the web page. But I am getting “Build Errors”.

It could be (hopefully !!) I am not creating the object correctly.
Or it could be the template interpolation is failing.

This is what makes me wonder if the code and architecture of search_invenio is correct…

Speaking of classes in entries.py, are they documented somewhere?
The input options are:

result_id
title
location
start_date
materials
authors
description

Unclear to me the type of some of them (strings, integers, …) and their meaning. Where can I find some documentation?

jcaballe · May 30, 2019, 2:48pm

replying myself…
I have just been reminded the invenio plugin does not work.
So the idea then is to interpolate the HTML templates directly with the JSON from the queries, correct?

penelopec · May 30, 2019, 4:35pm

It always helps to have the correct network configuration…
I followed @pferreir instructions and it was really simple to have docker running.
The following are all the commands I used on my RHEL 7 VM:

$ yum -y install docker
$ pip install docker-compose
$ systemctl start docker
$ git clone https://github.com/inveniosoftware-contrib/cern-search
$ cd cern-search
$ wget https://gist.githubusercontent.com/pferreir/77ede49adb292879c52e3e4a02e28582/raw/c26b7031a2fd5d01c9ac82293300b785d54dd7c9/docker-compose.yml
$ sh scripts/gen-cert.ch
$ mv nginx.crt nginx/tls/tls.crt
$ mv nginx.key nginx/tls/tls.key
$ rm nginx.csr
$ docker-compose up -d

Then, I was able to access Invenio from my desktop (https://web4604.fnal.gov:8080) and used username test@example.com -password test1234

As far as tika, I was able to access it from another server without a problem:

>>> from tika import parser
>>> parser.from_file("./penelope.py", serverEndpoint="http://web4604.fnal.gov:9998")
{'status': 200, 'content': u'\n\n\n\n\n\n\n\n\nfrom __future__ import unicode_literals\
.......

The next steps will be to access the cern-search-api: send indico livesync data to be indexed and then send search requests and receive the search results.
I assume that the example at http://cernsearchdocs.web.cern.ch/cernsearchdocs/example/ and the rest of the rest of the documentation should be our starting point.

ThiefMaster · May 30, 2019, 5:41pm

Just a small suggestions: Do not use pip install docker-compose - it installs TONS of dependencies, and when used outside a virtualenv it leaves behind a huge mess of python packages in your system python environment.

Better download a single-file bundle from https://github.com/docker/compose/releases: https://github.com/docker/compose/releases/download/1.24.0/docker-compose-Linux-x86_64, save it as /usr/local/bin/docker-compose and chmod +x it

penelopec · May 30, 2019, 5:46pm

@ThiefMaster Thank you for the information. Yes, I did notice all the packages it installs but I followed the instruction as I was not sure of what is needed.

pferreir · May 31, 2019, 9:38am

Yes, that’s the idea. I wouldn’t spend tons of time with the interface, however. We will have someone working on a fancy UI on our side, this summer. So, a simple Google-like thing would be enough for now.

jcaballe · May 31, 2019, 2:35pm

OK.
Would you then recommend me to adapt search_cern plugin https://github.com/indico/indico-plugins-cern/tree/master/search_cern ?
I was not planning on changing neither the interface nor the rendering. My plan was just to change the plugin to handle the new JSON output from the queries to “CERN Search” and let the existing code to do the rest. Right?
So, if that sounds like a reasonable approach, then I guess the steps here are:

find the exact method where the queries are performed. In search_invenio plugin was _fetch_data(). I need to find out where exactly is done in search_cern
find out where the output of the query is being used to fill the HTML templates.
massage, if needed, the JSON output to be able to fill the HTML templates with it. The templates in search_cern are supposed to be correct, I assume…

Sounds correct to you?

pferreir · May 31, 2019, 2:54pm

The search_cern plugin is not a great example because it uses an <iframe> to show the results. So, it does absolutely no rendering of any results, it just displays the page that is sent by the search engine (Sharepoint in this case). So, yes, you can adapt it, but then you’ll have to write a very basic interface. You can actually just “steal” it from the old Invenio plugin: https://github.com/indico/indico-plugins/blob/master/search_invenio/indico_search_invenio/templates/results.html

find the exact method where the queries are performed. In search_invenio plugin was _fetch_data(). I need to find out where exactly is done in search_cern

You probably want https://github.com/indico/indico-plugins-cern/blob/master/search_cern/indico_search_cern/engine.py#L28.

find out where the output of the query is being used to fill the HTML templates.

It’s not. But you can steal that from the Invenio plugin as I’ve said.

massage, if needed, the JSON output to be able to fill the HTML templates with it.

Yes!

jcaballe · May 31, 2019, 3:11pm

Oh. Then, my original approach was not that bad after all.
I was studying the invenio plugin. I more or less got the general idea

invenio

In this case, the output is converted (or tried to) to objects Author(), EventEntry(), ContributionEntry(), and SubContributionEntry(). After that, they are supposed to be used to fill the HTML templates. If I got the logic correctly.
Therefore, I was working assuming that, if you create those Entry() objects properly, the rendering would work.
From your answer I get the templates in invenio are correct.
So I assume the idea is to reuse

code from search_cern, as much as possible
templates from search_invenio

Did I get your comments correctly?

pferreir · May 31, 2019, 3:12pm

Yes, that’s what I mean, and it should be possible.

penelopec · June 20, 2019, 7:03pm

The Elasticsearch (CERN search) marshmallow schema is almost finished (I placed a draft at: Elasticsearch_Docs/schemas.py at master · penelopec/Elasticsearch_Docs · GitHub)

There are only the following fields that I am not able to get any values and I am not able to find information about these fields:

ContributionSchema:

creation_date = mm.DateTime(attribute='created_dt')

SubContributionSchema:

creation_date = mm.DateTime(attribute='created_dt')
start_date = mm.DateTime(attribute='start_dt')
end_date = mm.DateTime(attribute='end_dt')

For the implementation I made the following assumptions:
ACL assumptions (for the read entry of _access):

For public access the ACL will contain only one entry 'ANONYMOUS' or it could be just empty, depending on what Pablo expects.
For private access will contain the users’ ID and the users’ email.
The ACL for subcontributions is that of the contribution it belongs to
The ACL of the EventNote is that of the object it belongs to (contribution, session or event)
For all mappings I have added a URL field to contain the external url for accessing the object from the search results.

The following are the questions that I have in order to move forward in the livesync_json (I decided that this is a better name for this plugin):

How do I access the CERN search app (assuming that I have installed in docker what Pedro has supplied)?
How should I call the CERN search app for populating ES?
For the ES I have the following request.post line:
response = requests.post(self.url, auth=(self.username, self.password), data={'json': jsondata})
where the jsondata is a string that complies with the form appropriate for the BULK api of ES
(Bulk API | Elasticsearch Guide [8.11] | Elastic).
I am using only the index and delete operations, and the _index is the mapping that I am accessing and _id is the object’s id:

POST _bulk

{ "index" : {"_index" : "events",  "_id" : 1 } }\n
{ "field1" : "value1" }\n
{ "delete" : { "_index" : "notes", "_id" : 2 } }\n

What information should the web setup page of the plugin contain?

tika server URL
CERN search app URL
Access username / password for ES(?)
??

pferreir · June 21, 2019, 7:24am

We don’t have a creation date for (sub-)contributions. I recall discussing that during the move to Postgres and believe that at the time we found it to be useless. Not sure I would have the same opinion today.
Also, sub-contributions have no start/end date, only a duration.